3,400 Tests From Bad Instructions: TDD Pipeline Fix

The Problem: Literal Interpretation at Scale

A developer created a multi-agent TDD pipeline using Claude Code, with different agents handling specific jobs: one writes tests, one writes code to pass them, one reviews everything, and one hunts for edge cases. The initial instruction was simple: "write tests for everything."

The system appeared to work - test count kept climbing and CI was green. However, an audit revealed problems with the 3,400 generated tests:

44% valid
30% needed rework
26% complete garbage

The garbage tests included:

Tests that constructed a JSON config object and then asserted it equaled itself
Tests that checked whether a TypeScript interface had the right shape by building the object and asserting it matches what they just built
Tests for static files that will never change

The developer deleted almost 20,000 lines of test code and identified the core issue: "Claude didn't screw up. I did. I said 'write tests for everything' and it heard me loud and clear. Every file. Every config. Every type definition. My instructions were the problem, and the agent followed them perfectly."

The Solution: Classification and Review

The fix involved two key changes:

1. Classifying work items before testing:

Features get 3-5 behavioral tests (does this thing actually work?)
Tasks get 1-2 smoke tests (did it break anything obvious?)
Bugs get 2-3 regression tests (will this specific bug come back?)
Enhancements only test new or changed behavior

2. Adding a review agent: A separate agent looks at both tests and implementation with fresh context, catching issues the writing agents missed because they were too close to their own output.

Results After the Fix

3,400 tests down to 2,525
Execution time dropped from 117 seconds to ~50 seconds
Every remaining test validates actual behavior

Key Insight

"Building with AI agents makes your sloppy thinking visible at scale. A human writes bad tests, you get a few bad tests. Give a bad instruction to an agent pipeline processing hundreds of work items? You get hundreds of bad tests. Same bad thinking, just amplified across everything it touches. Fix the thinking, fix the output."

📖 Read the full source: r/ClaudeAI