AI TDD Pipeline: How Bad Instructions Created 3,400 Tests and What Fixed It

The Problem: Literal Interpretation at Scale
A developer created a multi-agent TDD pipeline using Claude Code, with different agents handling specific jobs: one writes tests, one writes code to pass them, one reviews everything, and one hunts for edge cases. The initial instruction was simple: "write tests for everything."
The system appeared to work - test count kept climbing and CI was green. However, an audit revealed problems with the 3,400 generated tests:
- 44% valid
- 30% needed rework
- 26% complete garbage
The garbage tests included:
- Tests that constructed a JSON config object and then asserted it equaled itself
- Tests that checked whether a TypeScript interface had the right shape by building the object and asserting it matches what they just built
- Tests for static files that will never change
The developer deleted almost 20,000 lines of test code and identified the core issue: "Claude didn't screw up. I did. I said 'write tests for everything' and it heard me loud and clear. Every file. Every config. Every type definition. My instructions were the problem, and the agent followed them perfectly."
The Solution: Classification and Review
The fix involved two key changes:
1. Classifying work items before testing:
- Features get 3-5 behavioral tests (does this thing actually work?)
- Tasks get 1-2 smoke tests (did it break anything obvious?)
- Bugs get 2-3 regression tests (will this specific bug come back?)
- Enhancements only test new or changed behavior
2. Adding a review agent: A separate agent looks at both tests and implementation with fresh context, catching issues the writing agents missed because they were too close to their own output.
Results After the Fix
- 3,400 tests down to 2,525
- Execution time dropped from 117 seconds to ~50 seconds
- Every remaining test validates actual behavior
Key Insight
"Building with AI agents makes your sloppy thinking visible at scale. A human writes bad tests, you get a few bad tests. Give a bad instruction to an agent pipeline processing hundreds of work items? You get hundreds of bad tests. Same bad thinking, just amplified across everything it touches. Fix the thinking, fix the output."
📖 Read the full source: r/ClaudeAI
👀 See Also

OpenClaw Personal Assistant Use Cases: Morning Briefings and Behavior Tracking
A Reddit user demonstrates using OpenClaw as a personal assistant for morning briefings with weather, calendar, and task integration, and built a custom smoke tracker skill that logs triggers to SQLite and combines data with calendar and sleep patterns.

Fine-tuning llama3.2 3B for personalized health coaching using Apple Watch data and MLX
A developer fine-tuned llama3.2 3B on a Mac using MLX in 15 minutes to create a health coach LLM that analyzes personal Apple Health and Whoop data. The model provides specific health insights instead of generic advice, running locally with a 2GB memory footprint.

Automated Daily Development Journal System with Discord Integration
A system that captures Discord development activity, generates visual summaries, and publishes daily blog posts automatically using kabi-discord-cli, cron jobs, and GitHub/Vercel deployment.

Practical Lessons from Running Multiple AI Agents in Production
A team running an AI-operated store with design, coding, and marketing agents shares insights on what 'hiring' AI agents means in practice, including how to provide sufficient context for autonomous work and where agents break down differently than humans.