Be brief beats caveman plugin in Claude Code compression benchmark

Max Taylor benchmarked the popular Claude Code compression plugin 'caveman' against a trivial baseline: prepending 'be brief.' to each prompt. The results are surprisingly flat — but reveal where the plugin actually delivers value.
Benchmark methodology
24 prompts across six categories (bug diagnosis, concept explanation, architecture tradeoffs, multi-step setup, security/destructive ops, error interpretation). Each prompt had a rubric with required key points, required terms, and forbidden claims. Five arms were tested: baseline (no instruction), 'be brief.', and caveman at three intensity levels (lite, full, ultra). All ran via claude -p on claude-opus-4-7. Responses were scored by claude-sonnet-4-6 against the rubric.
Quality results
All arms scored within 1.5% of each other:
- Baseline: 0.985
- Brief: 0.985
- Lite: 0.976
- Full: 0.975
- Ultra: 0.970
Every arm hit 100% of key points. Zero forbidden claims triggered across 120 responses. Compression did not drop substantive content.
Token counts
| Arm | Mean tokens |
|---|---|
| Baseline | 636 |
| Brief | 419 (34% cut) |
| Lite | 401 |
| Full | 404 |
| Ultra | 449 |
'Be brief.' cut tokens by 34% vs baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three — but the category split tells a different story.
The category split reveals caveman's design
On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied. Compression works as advertised. On multi-step setup and security warnings, all caveman modes show higher token counts. The reason: caveman's 'Auto-Clarity' rule explicitly disables compression for safety warnings, irreversible actions, and multi-step sequences. The safety escape engages, and compression stops — by design.
So what is caveman actually for?
If 'be brief.' matches on tokens and quality, the plugin's value is structural:
- Consistent output shape — every response follows the same pattern, useful for downstream tooling or uniform session feel.
- Intensity dial — slash commands to switch lite/full/ultra mid-session.
- Persistence across long sessions — caveman re-injects its ruleset via
SessionStartandUserPromptSubmithooks to prevent drift (not tested in this single-shot benchmark).
The full dataset and harness are open source.
📖 Read the full source: HN AI Agents
👀 See Also

Super Claude browser extension tracks Claude AI usage velocity and limit predictions
A developer built a browser extension called Super Claude that adds usage velocity indicators and time-to-100% predictions directly in the Claude UI, helping users monitor their 5-hour allocation consumption.

Pilot Protocol: A P2P Network Stack for AI Agents Built with Claude
A developer built Pilot Protocol, a pure user-space peer-to-peer virtual network stack in Go specifically for autonomous AI agents, enabling direct communication without centralized infrastructure. The protocol uses UDP multiplexing, NAT traversal, and end-to-end encryption, with benchmarks showing 89 MB/s local throughput and 2.1 MB/s cross-continent WAN throughput.

Claude Sleuth: A 56-Task Investigation Workflow for Claude AI
Claude Sleuth is a structured investigation workflow for Claude AI with 6 phases and 56 tasks, featuring persistent state storage via Cloudflare D1 and standardized output conventions including ISO 8601 timestamps, POLE entity records, and ICD 203 probability language.

Heartbeat-gateway: Event-driven replacement for cron polling in OpenClaw
Heartbeat-gateway is an open-source Python tool that replaces cron-based polling with webhook-driven events for OpenClaw, reducing API costs from ~$86/month to ~$4.50/month and improving latency from up to 30 minutes to under 2 seconds.