Open-Source Benchmark Runner for Testing OpenClaw Agents on Real Workflows

✍️ OpenClawRadar📅 Published: May 14, 2026🔗 Source
Open-Source Benchmark Runner for Testing OpenClaw Agents on Real Workflows
Ad

A Reddit user has released an open-source tool called personal_agent_eval (repo: github.com/javiersgjavi/personal_agent_eval) for benchmarking OpenClaw agents on realistic, messy workflows — not public toy datasets.

Workflow

Define test cases as YAML files containing:

  • Input messages
  • Expected artifacts
  • Evaluation criteria
  • Deterministic checks
  • Run profiles and judge profiles

The runner executes cases against an actual OpenClaw instance, stores outputs, evaluates runs, and generates reports and charts.

Key Feature: Real Workspace Import

You can import your actual OpenClaw workspace — including memory, skills, files, prompts, and context — instead of a stripped-down imitation. The agent runs in a real OpenClaw instance, testing the exact agent you use daily.

Private Evaluation Sets

The author explicitly does not publish their private evaluation sets to avoid public benchmarks becoming stale. However, the repo includes example cases, configs, evaluation profiles, deterministic checks, and chart generation so you can build your own private suite.

Ad

SKILL.md for Agent Assistance

A SKILL.md file in the repo is designed to give an agent enough context to help you define new benchmark cases, run profiles, evaluation criteria, and deterministic checks — reducing manual editing.

Sample Results (Author’s Private Run)

The author shared a single-run comparison (metric unclear, likely weighted average 0-10):

Claude Opus 4.6 - 9.44
GLM 5.1 - 9.31
GPT-5.5 - 9.31
Claude Sonnet 4.6 - 9.25
DeepSeek V4 Flash - 8.61
Gemma 4 31B - 8.39
DeepSeek V4 Pro - 8.28
Kimi K2.6 - 7.97

More interesting than scores: failure modes. Some models reason well but are clumsy with tools; cheaper models degrade on long or stateful tasks; some failures are model behavior, others are OpenClaw/tooling edge cases exposed by the benchmark.

Who It’s For

OpenClaw users who run agents for real work and want to compare models on their own private tasks rather than arguing from vibes or generic leaderboards.

📖 Read the full source: r/openclaw

Ad

👀 See Also