CC-Canary: Detect Regressions in Claude Code with Local JSONL Analysis

✍️ OpenClawRadar📅 Published: April 24, 2026🔗 Source
CC-Canary: Detect Regressions in Claude Code with Local JSONL Analysis
Ad

CC-Canary is a drift detection tool for Claude Code, packaged as two installable Agent Skills. It scans the JSONL session logs that Claude Code already writes to ~/.claude/projects/, detects whether the model has been drifting on your own work, and produces a shareable forensic report. No network, no account, no telemetry, no background daemon — runs on data already on your disk. Status: 0.x / pre-alpha.

Installation

Install via npx skills:

npx skills add delta-hq/cc-canary

Or install individual skills:

npx skills add delta-hq/cc-canary --skill cc-canary npx skills add delta-hq/cc-canary --skill cc-canary-html

Requirements: Python 3.8+ on PATH. macOS/Linux/WSL for auto-open of HTML report (falls back to printing path).

Usage

From a Claude Code session:

/cc-canary 60d /cc-canary-html 30d

The window defaults to 60 days; accepts 7d, 14d, 30d, 60d, 90d, 180d.

What You Get

  • Verdict — HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE
  • Headline metrics table — pre vs post comparison with green/yellow/red bands
  • Weekly trend bars — cost (USD, verified against ccusage), read:edit ratio, reasoning loops, tokens/turn
  • Cross-version comparison — same user, different model versions, controlling for task mix
  • Auto-detected inflection date — composite health-score break
  • Findings with model-side / user-side / ambiguous classification
  • Appendices — hour-of-day thinking depth, word-frequency shift, three-period thinking-visibility transition, per-turn behavior rates
Ad

Metrics Tracked

  • Read:Edit ratio — file reads per edit; proxy for investigation thoroughness
  • Write share of mutations — Write / (Edit + Write); high share = rewriting instead of surgical edits
  • Reasoning loops / 1K tool calls — phrases like "let me try again", "oh wait", "actually"
  • Frustration rate — rate of frustration words in your prompts
  • Thinking redaction rate — fraction of thinking blocks redacted vs visible
  • Mean thinking length — reasoning-depth proxy
  • API turns per user turn — API calls per user message
  • Tokens per user turn — total token volume per user message

Plus appendices for premature stopping, self-admitted errors, shortcut vocabulary, user interrupts, etc.

How It Works

  1. Scan — Python script (stdlib only) walks ~/.claude/projects/**/*.jsonl, filters by window, excludes subagent sessions.
  2. Dedupe — Assistant messages deduped on (message.id, requestId) because Claude Code writes the same message into multiple JSONLs when sessions are resumed or branched.
  3. Aggregate — Per-session metrics: tool-mix, read:edit ratio, reasoning-loop phrases, self-admitted errors, premature stops, interrupts, token usage, cost (current Claude 4.x rates), hour-of-day thinking depth.
  4. Detect inflection — Composite health score per day; argmax of |before − after| over candidate dates with 0.75σ floor. Falls back to median-timestamp split if no break clears.
  5. Pre-render report — Script writes markdown/HTML skeleton with every table and bar chart filled in. ~20 narrative slots left for Claude to fill.
  6. Fill & save — Claude reads skeleton, writes narrative, saves final file. Total runtime: ~2.5s script + 10–20s Claude narrative.

📖 Read the full source: HN AI Agents

Ad

👀 See Also