11 Multi-Agent Builds: Scope Enforcement Works Mechanically, Not Via Prompts

Key Technical Findings from Multi-Agent System Experiments

Analysis of 11 autonomous multi-agent software builds without programmatic scaffolding, based on 295M tokens, 98 agent sessions, and 6.1M lines of worker output, reveals practical insights for developers working with AI coding agents.

Scope Enforcement and Orchestration

Scope enforcement is solved mechanically, not through prompts: Prompt-based approaches failed 0/20 times under compiler pressure, while mechanical approaches (letting agents edit everything and using git revert for out-of-scope files) succeeded 20/20 times. The key insight: don't ask models to respect boundaries—enforce them after the fact.

Orchestrator costs are memory-bound: Approximately 95% of input spend is re-reading conversation history. The "statefulness premium" means a frontier orchestrator that writes zero shipped code can cost as much as the entire worker fleet. Optimization should target fewer turns and less re-ingestion, not cheaper reasoning.

Coordination and Scaling Dynamics

Models don't independently discover coordination: Bare-prompt Opus with full tool access never delegated, never wrote specs, and never discovered parallel dispatch—it just built everything solo. The coordination template does real work.

Depth scales differently than quality: Flat dispatch beats hierarchy at ≤10 domains on throughput, token efficiency, and wall time. Above 10 domains, hierarchy enables parallelism that flat dispatch can't achieve.

Solo outperforms coordination until context limits bind: Solo throughput is approximately 325 LOC/min and invariant to project size. Pyramid throughput scales with workers. Below ~30K LOC, delegation is pure overhead.

Worker Performance and Type Systems

Worker model capability drives throughput: Same architecture, same spec, three worker models produced: 17,761 LOC vs 6,001 vs 1,818—a 9.8x gap. Architecture enables parallel throughput; the worker model determines it.

Type contracts provide shared vocabulary: Integration succeeds without contracts at every scale tested (6–36 modules), even under read-only constraints. But without contracts, parallel workers silently produce structurally incompatible types that compile only because nothing cross-references. A single 984-line contract written blind held across 10 independent domains.

Type contracts eliminate coordination overhead at scale: Controlled scaling test (1–20 workers, fixed spec) showed zero integration errors across 50 domain builds. Sweet spot at 10 workers: 2.05x wall-time speedup. At 20 workers, serial phase dependencies negate parallelism gains (Amdahl's serial fraction ~44%).

Context and Delegation Patterns

Context priming works; format doesn't matter: 0% formula transfer cold, 100% with design context present (N=10 per condition). A static reference document produces identical transfer rates to a synthetic boot conversation.

Delegation compression is inherent: Each delegation layer acts as a lossy summarizer. Quantitative requirements ("80 weapons") vanish; structural requirements (type interfaces) survive. Fix: workers should read full specs from the filesystem rather than relying on compressed prompt chains.

Compaction recovery is robust with good summaries: Zero task relapse across 11 compaction events. The model states expected state, then reads disk to verify.

Failure Modes and Fixes

Abstraction reflex: Builds an orchestrator instead of orchestrating—name it in the prompt
Self-model error: Claims false capabilities—document available tools explicitly
Identity paradox: Can't hold dual roles—use separate model instances
Delegation compression: Use enumerative specs plus filesystem access

📖 Read the full source: r/ClaudeAI