PhAIL Benchmark Tests VLA Models on Real Warehouse Robot Tasks

✍️ OpenClawRadar📅 Published: April 1, 2026🔗 Source

PhAIL is a physical AI benchmark that measures how well vision-language-action (VLA) models perform on commercial robotics tasks. The creator built it because they couldn't find honest performance numbers for these models in practical applications.

Benchmark Details

The benchmark tests four VLA models on bin-to-bin order picking, one of the most common warehouse operations:

OpenPI/pi0.5
GR00T
ACT
SmolVLA

All tests use the same equipment: a Franka FR3 robot with Robotiq 2F-85 gripper (DROID setup), with identical objects across hundreds of blind runs where the operator doesn't know which model is running.

Performance Results

The benchmark revealed significant performance gaps:

Best model performance: 64 units per hour (UPH)
Human teleoperating the same robot: 330 UPH
Human performing the task by hand: 1,300+ UPH

Open Data and Methodology

Everything from the benchmark is publicly available:

Every run with synced video and telemetry data
The fine-tuning dataset used for training
Training scripts
An open leaderboard accepting new submissions

The creator is available to answer questions about methodology, the specific models tested, or observations from the benchmark runs.

📖 Read the full source: HN AI Agents

👀 See Also

Tools

obsidian-mcp: Graph-Aware MCP Server for Claude with 25 Tools Targeting Large Vaults

obsidian-mcp is an MCP server exposing 25 tools (including get_note, traverse_graph, query_dataview, move_note, create_notes) that gives Claude graph-aware access to your Obsidian vault — avoiding context window death on 5k-note vaults. MIT, works with Claude Desktop, Claude Code, Cursor, Cline, Continue, Zed.

May 2, 2026, 02:16 AM UTC

OpenClawRadar

Tools

MCP Server Connects Claude to CellarTracker Wine Inventory

A developer built an MCP server that connects Claude directly to CellarTracker accounts, allowing conversational queries about wine inventory, tasting notes, purchase history, and drinking windows without manual CSV exports.

Mar 17, 2026, 01:45 PM UTC

OpenClawRadar

Tools

graphify-ts: Local MCP server cuts Claude Code PR review tokens from 63K to 8.7K

graphify-ts builds a local knowledge graph of your codebase using tree-sitter AST + Louvain communities + BM25 + optional ONNX rerank, exposing it via MCP stdio. In production tests, it reduced input tokens by 2.6x and latency by 2.8x for code queries, and cut PR review prompts from 63K to 8.7K tokens.

May 5, 2026, 10:15 PM UTC

OpenClawRadar

Tools

Canary: AI QA Agent for Automated Testing Based on Code Changes

Canary is an AI QA agent that reads codebases, analyzes pull request diffs, and generates end-to-end tests for affected user workflows. It connects to preview environments, runs tests, and comments results directly on PRs with recordings.

Mar 19, 2026, 10:45 PM UTC

OpenClawRadar