Comparing Multi-Agent AI Systems: Anthropic's Harness vs Agyn's Engineering Org Model

Anthropic has published a harness design for long-running application development, while the Agyn multi-agent system for team-based autonomous software engineering was open-sourced last month on arXiv. Both approaches reject the "monolithic agent" model and instead structure AI agents to work like real engineering teams with role separation, structured handoffs, and review loops.
Core Architecture Differences
Anthropic's system uses a GAN-inspired architecture with three roles: planner → generator → evaluator. The evaluator uses Playwright to interact with the running application like a real user, then provides structured critique back to the generator.
Agyn models the process as an engineering organization with four roles: coordination → research → implementation → review. Agents operate in isolated sandboxes and communicate through defined contracts.
Shared Solutions to Common Problems
- Models losing coherence over long tasks: Anthropic uses context resets with structured handoff artifacts, while Agyn uses compaction with structured handoffs between roles
- Self-evaluation being too lenient: Both systems separate evaluation from generation. Anthropic uses a separate evaluator agent calibrated on few-shot examples, while Agyn has a dedicated review role separated from implementation
- Ambiguous "done" criteria: Anthropic uses sprint contracts negotiated before work starts, while Agyn has a task specification phase with explicit acceptance criteria and required tests
- Complex task decomposition: Anthropic's planner expands one-sentence prompts into full specifications, while Agyn's researcher agent decomposes issues and produces specifications before implementation begins
- Context anxiety: Anthropic uses resets for clean slates, while Agyn uses compaction with a memory layer
Agyn's Distinctive Features
Agyn includes two features not present in Anthropic's harness:
- Isolated sandboxes per agent: Each agent operates in its own isolated file and network namespace, preventing collisions on shared state during parallel or sequential work
- GitHub as shared state: The system uses GitHub primitives (commits, comments, PRs, reviews) that human teams already understand, providing a full audit log without requiring custom communication protocols
Implementation Differences
Anthropic's harness is built tightly around Claude using the Claude Agent SDK and Playwright MCP for the evaluation loop. The evaluator navigates live running applications before scoring.
Agyn is model-agnostic by design, supporting Claude, Codex, and open-weight models. The system allows mixing different models per role, which in practice has been found to outperform using one model for everything.
📖 Read the full source: r/ClaudeAI
👀 See Also

TextForge: A Claude Code-built email approval tool for LLM workflows
A developer built TextForge using Claude Code to automate email workflows with mandatory approval gates, preventing LLMs from sending emails without explicit permission. The tool integrates with Pipedrive CRM and required Google CASA2 security audit compliance.

Deploy OpenClaw on VPS with a One-Command CLI
A Reddit user claims to have developed a CLI that deploys OpenClaw on a $4.99/month VPS with a single command, offering a cost-effective alternative to using Mac Minis.

Single-page chatbot interface for locally running Gemma 4 26B A4B
A developer built a single HTML page chatbot that connects to Gemma 4 26B A4B running locally with 32K context window at 50-65 tokens/second, sharded between a 7900 XT and 3060 Ti GPU. The interface includes full streaming, Markdown rendering, and parameter controls.

Local Trello-Style Project Manager for OpenClaw Agents
A developer built a local Trello-like project management tool that runs on the same machine as their OpenClaw agent, storing cards as markdown files with YAML frontmatter. The system uses Node.js/Express for the API, React for the UI, and allows the AI agent to read/write files directly on the filesystem.