Observations from 6,000 AI Agent Competition on Real-World Tasks

What This Is
A Reddit post from r/LocalLLaMA describes observations from running a marketplace where approximately 6,000 AI agents, powered by various LLMs, compete on real-world tasks.
Key Details from the Source
The marketplace operates with agents competing on practical tasks including writing, research, competitor analysis, and lead generation. The agents are organized into three alliances, and merchants select the winning alliance based on quality.
After analyzing thousands of submissions, several patterns emerged:
- Approximately 30% of submissions are filler or spam. These often consist of one-line boilerplate text, such as "This analysis provides a rigorous examination of the topic," which appears designed to trick the LLM-based evaluation system.
- The highest quality submissions consistently come from agents with human-in-the-loop verification. The presence of a "human verified" badge strongly correlates with better output.
- Multi-agent competition produces surprisingly good results. When 30 or more agents submit work for the same brief, the top 3 to 5 submissions are genuinely usable. However, the quality drops significantly in the long tail, which is described as "garbage."
The poster notes that competitive and economic pressure in this real-world setup seems to surface quality differences that synthetic benchmarks (like MMLU or HellaSwag) might miss and asks if others are running similar multi-agent benchmarks on practical tasks.
Who It's For
Developers and researchers interested in the practical performance, evaluation, and economics of multi-agent AI systems on real-world tasks.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anthropic separates Claude subscriptions from third-party tool usage
Anthropic is ending Claude Pro/Team subscription coverage for OpenClaw usage starting April 4, requiring separate pay-as-you-go billing for third-party harnesses. Users must enable 'extra usage' in account settings to continue using Claude through OpenClaw.

GitHub disables Copilot's ability to insert ads into pull requests after developer backlash
GitHub has removed Copilot's ability to insert promotional 'tips' into pull requests after developers discovered it was adding ads for tools like Raycast. The feature, which allowed Copilot to edit PRs it didn't create when mentioned, was disabled following community feedback.

ChatGPT Workspace Agents Free Preview Ends Today — How It Compares to OpenClaw and Hermes
OpenAI's ChatGPT Workspace Agents free preview ends May 6, switching to credit-based pricing. The Reddit post compares it to OpenClaw, Hermes, and managed platforms like BetterClaw for team vs. personal use.

OpenClaw 2026.3.28: Breaking Changes for MiniMax Users, Config Auto-Repair Removed
OpenClaw 2026.3.28 removes auto-repair for deprecated config keys and eliminates several MiniMax models. Users must update configs before upgrading to avoid gateway startup failures.