LLM Skirmish: A Real-Time Strategy Game Benchmark for AI Coding Agents

What LLM Skirmish Is
LLM Skirmish is a benchmark environment where large language models compete in 1v1 real-time strategy games by writing code strategies. The project draws on the Screeps API paradigm - originally an "MMO RTS sandbox for programmers" - where code executes directly in the game environment.
Tournament Structure
Each tournament consists of five rounds. In round one, LLMs write initial strategies. For rounds 2-5, they can review match results from previous rounds and adapt their scripts. Every player faces all other players once per round, resulting in 10 matches per round and 50 matches per tournament.
The objective is to eliminate the opponent's spawn building within 2,000 game frames (each player gets up to one second of runtime computation per frame). If no spawn is eliminated, victory is determined by score.
Technical Implementation
The system uses OpenCode, an open-source agentic coding harness, running in isolated Docker containers. Agents receive:
OBJECTIVE.md- game rules, API documentation, and script writing instructionsNEXT_ROUND.md- instructions for reviewing previous match logs (rounds 2-5 only)- Two example strategies as reference
Scripts are validated after creation, with agents getting up to 3 attempts to fix errors before the round proceeds.
Performance Results
Current standings from testing:
- Claude Opus 4.5: 85 wins, 15 losses (85% win rate, 1778 ELO)
- GPT 5.2 (high reasoning level): 68 wins, 32 losses (68% win rate, 1625 ELO)
- Grok 4.1 Fast: 39 wins, 61 losses (39% win rate, 1427 ELO)
- GLM 4.7: 32 wins, 68 losses (32% win rate, 1372 ELO)
- Gemini 3 Pro: 26 wins, 74 losses (26% win rate, 1297 ELO)
Most models showed improved performance across rounds, indicating in-context learning: Claude Opus 4.5 (+20% win rate from round 1 to 5), GLM 4.7 (+16%), GPT 5.2 (+7%), Grok 4.1 Fast (+6%). Gemini 3 Pro was an anomaly with 70% win rate in round 1 but only 15% in rounds 2-5.
Development Notes
The creator spent significant time on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading opponent strategies. Claude Opus 4.5 showed dominance but was overly focused on economy in early rounds.
Future testing is planned with newer models like Claude 4.6 Opus and GPT 5.3 Codex.
Getting Started
You can run local matches via CLI. The hosted match runner uses Google Cloud Run with isolated-vm, and match visualizations are served from Cloudflare. A community ladder accepts strategy submissions via CLI without authentication. The CLI plus skill.md documentation is sufficient for AI agents to begin immediately.
📖 Read the full source: HN AI Agents
👀 See Also

CSS Modern Features Agent Skill: Enforce Modern CSS Practices in AI Coding Agents
An agent skill that enforces 57+ modern CSS features across color, layout, selectors, animation, typography, positioning, and component patterns, compatible with Claude Code, Cursor, Windsurf, Codex, Cline, and GitHub Copilot.

Memento v1.0: Local Persistent Memory for AI Coding Agents
Memento v1.0 is a fully local memory layer for AI coding agents that runs embeddings, storage, and search on your machine with no cloud dependencies. It uses all-MiniLM-L6-v2 embeddings, HNSW indexing, and supports multiple IDEs with 17 MCP tools.

Claude Code Built Treelo: A Free Video Transcription Tool
A video editor used Claude Code to build Treelo, a free tool that transcribes video/audio files, removes filler words, allows SFX placement at exact timestamps, and exports SRT for Premiere or ASS for DaVinci Resolve.

Building a Coding Agent for 8k Context: Planner/Executor Split, Token Budgeting, and Parallel Execution
A detailed breakdown of building a CLI coding agent designed around 8k token limits, using a planner/executor architecture, strict token budgeting, and parallel task execution.