Detect Claude Code Regressions with CC-Canary JSONL Analysis

CC-Canary is a drift detection tool for Claude Code, packaged as two installable Agent Skills. It scans the JSONL session logs that Claude Code already writes to ~/.claude/projects/, detects whether the model has been drifting on your own work, and produces a shareable forensic report. No network, no account, no telemetry, no background daemon — runs on data already on your disk. Status: 0.x / pre-alpha.

Installation

Install via npx skills:

npx skills add delta-hq/cc-canary

Or install individual skills:

npx skills add delta-hq/cc-canary --skill cc-canary npx skills add delta-hq/cc-canary --skill cc-canary-html

Requirements: Python 3.8+ on PATH. macOS/Linux/WSL for auto-open of HTML report (falls back to printing path).

Usage

From a Claude Code session:

/cc-canary 60d /cc-canary-html 30d

The window defaults to 60 days; accepts 7d, 14d, 30d, 60d, 90d, 180d.

What You Get

Verdict — HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE
Headline metrics table — pre vs post comparison with green/yellow/red bands
Weekly trend bars — cost (USD, verified against ccusage), read:edit ratio, reasoning loops, tokens/turn
Cross-version comparison — same user, different model versions, controlling for task mix
Auto-detected inflection date — composite health-score break
Findings with model-side / user-side / ambiguous classification
Appendices — hour-of-day thinking depth, word-frequency shift, three-period thinking-visibility transition, per-turn behavior rates

Metrics Tracked

Read:Edit ratio — file reads per edit; proxy for investigation thoroughness
Write share of mutations — Write / (Edit + Write); high share = rewriting instead of surgical edits
Reasoning loops / 1K tool calls — phrases like "let me try again", "oh wait", "actually"
Frustration rate — rate of frustration words in your prompts
Thinking redaction rate — fraction of thinking blocks redacted vs visible
Mean thinking length — reasoning-depth proxy
API turns per user turn — API calls per user message
Tokens per user turn — total token volume per user message

Plus appendices for premature stopping, self-admitted errors, shortcut vocabulary, user interrupts, etc.

How It Works

Scan — Python script (stdlib only) walks ~/.claude/projects/**/*.jsonl, filters by window, excludes subagent sessions.
Dedupe — Assistant messages deduped on (message.id, requestId) because Claude Code writes the same message into multiple JSONLs when sessions are resumed or branched.
Aggregate — Per-session metrics: tool-mix, read:edit ratio, reasoning-loop phrases, self-admitted errors, premature stops, interrupts, token usage, cost (current Claude 4.x rates), hour-of-day thinking depth.
Detect inflection — Composite health score per day; argmax of |before − after| over candidate dates with 0.75σ floor. Falls back to median-timestamp split if no break clears.
Pre-render report — Script writes markdown/HTML skeleton with every table and bar chart filled in. ~20 narrative slots left for Claude to fill.
Fill & save — Claude reads skeleton, writes narrative, saves final file. Total runtime: ~2.5s script + 10–20s Claude narrative.