Open-Source Benchmark Runner for Testing OpenClaw Agents on Real Workflows

A Reddit user has released an open-source tool called personal_agent_eval (repo: github.com/javiersgjavi/personal_agent_eval) for benchmarking OpenClaw agents on realistic, messy workflows — not public toy datasets.
Workflow
Define test cases as YAML files containing:
- Input messages
- Expected artifacts
- Evaluation criteria
- Deterministic checks
- Run profiles and judge profiles
The runner executes cases against an actual OpenClaw instance, stores outputs, evaluates runs, and generates reports and charts.
Key Feature: Real Workspace Import
You can import your actual OpenClaw workspace — including memory, skills, files, prompts, and context — instead of a stripped-down imitation. The agent runs in a real OpenClaw instance, testing the exact agent you use daily.
Private Evaluation Sets
The author explicitly does not publish their private evaluation sets to avoid public benchmarks becoming stale. However, the repo includes example cases, configs, evaluation profiles, deterministic checks, and chart generation so you can build your own private suite.
SKILL.md for Agent Assistance
A SKILL.md file in the repo is designed to give an agent enough context to help you define new benchmark cases, run profiles, evaluation criteria, and deterministic checks — reducing manual editing.
Sample Results (Author’s Private Run)
The author shared a single-run comparison (metric unclear, likely weighted average 0-10):
Claude Opus 4.6 - 9.44 GLM 5.1 - 9.31 GPT-5.5 - 9.31 Claude Sonnet 4.6 - 9.25 DeepSeek V4 Flash - 8.61 Gemma 4 31B - 8.39 DeepSeek V4 Pro - 8.28 Kimi K2.6 - 7.97
More interesting than scores: failure modes. Some models reason well but are clumsy with tools; cheaper models degrade on long or stateful tasks; some failures are model behavior, others are OpenClaw/tooling edge cases exposed by the benchmark.
Who It’s For
OpenClaw users who run agents for real work and want to compare models on their own private tasks rather than arguing from vibes or generic leaderboards.
📖 Read the full source: r/openclaw
👀 See Also

Developer shares hybrid AI coding workflow: Claude for planning, local models for execution
A developer built a pipeline using Claude 3.5 Sonnet for task planning and local Qwen2.5-Coder models via Ollama for code generation, achieving 85% token reduction compared to using Claude alone.

Claude AI Built a UFO Data Visualizer with Government Data in Hours
A Reddit user used Claude AI to build a full-stack UFO sighting visualizer from newly released U.S. Dept. of War data, hosted on Cloudflare, in just a few hours.

Claude-Powered MCP Tool Generates Interactive HTML Components Without Build Tools
A developer built daub.dev, a system where Claude drives an MCP server to produce styled, interactive HTML UI components from natural language descriptions without React, bundlers, or build pipelines.

Cognithor: A Local-First Agent OS with PGE Trinity Architecture
Cognithor is a fully local, autonomous Agent OS built over a year with 16 development phases. It features the PGE Trinity architecture (Planner → Gatekeeper → Executor), 11,609+ tests with 89% coverage, and supports 16 LLM providers including Ollama and LM Studio.