Multi-Agent AI: Harness vs Engineering Org Model

Anthropic has published a harness design for long-running application development, while the Agyn multi-agent system for team-based autonomous software engineering was open-sourced last month on arXiv. Both approaches reject the "monolithic agent" model and instead structure AI agents to work like real engineering teams with role separation, structured handoffs, and review loops.

Core Architecture Differences

Anthropic's system uses a GAN-inspired architecture with three roles: planner → generator → evaluator. The evaluator uses Playwright to interact with the running application like a real user, then provides structured critique back to the generator.

Agyn models the process as an engineering organization with four roles: coordination → research → implementation → review. Agents operate in isolated sandboxes and communicate through defined contracts.

Shared Solutions to Common Problems

Models losing coherence over long tasks: Anthropic uses context resets with structured handoff artifacts, while Agyn uses compaction with structured handoffs between roles
Self-evaluation being too lenient: Both systems separate evaluation from generation. Anthropic uses a separate evaluator agent calibrated on few-shot examples, while Agyn has a dedicated review role separated from implementation
Ambiguous "done" criteria: Anthropic uses sprint contracts negotiated before work starts, while Agyn has a task specification phase with explicit acceptance criteria and required tests
Complex task decomposition: Anthropic's planner expands one-sentence prompts into full specifications, while Agyn's researcher agent decomposes issues and produces specifications before implementation begins
Context anxiety: Anthropic uses resets for clean slates, while Agyn uses compaction with a memory layer

Agyn's Distinctive Features

Agyn includes two features not present in Anthropic's harness:

Isolated sandboxes per agent: Each agent operates in its own isolated file and network namespace, preventing collisions on shared state during parallel or sequential work
GitHub as shared state: The system uses GitHub primitives (commits, comments, PRs, reviews) that human teams already understand, providing a full audit log without requiring custom communication protocols

Implementation Differences

Anthropic's harness is built tightly around Claude using the Claude Agent SDK and Playwright MCP for the evaluation loop. The evaluator navigates live running applications before scoring.

Agyn is model-agnostic by design, supporting Claude, Codex, and open-weight models. The system allows mixing different models per role, which in practice has been found to outperform using one model for everything.

📖 Read the full source: r/ClaudeAI