PhAIL Benchmark Tests VLA Models on Real Warehouse Robot Tasks

PhAIL is a physical AI benchmark that measures how well vision-language-action (VLA) models perform on commercial robotics tasks. The creator built it because they couldn't find honest performance numbers for these models in practical applications.
Benchmark Details
The benchmark tests four VLA models on bin-to-bin order picking, one of the most common warehouse operations:
- OpenPI/pi0.5
- GR00T
- ACT
- SmolVLA
All tests use the same equipment: a Franka FR3 robot with Robotiq 2F-85 gripper (DROID setup), with identical objects across hundreds of blind runs where the operator doesn't know which model is running.
Performance Results
The benchmark revealed significant performance gaps:
- Best model performance: 64 units per hour (UPH)
- Human teleoperating the same robot: 330 UPH
- Human performing the task by hand: 1,300+ UPH
Open Data and Methodology
Everything from the benchmark is publicly available:
- Every run with synced video and telemetry data
- The fine-tuning dataset used for training
- Training scripts
- An open leaderboard accepting new submissions
The creator is available to answer questions about methodology, the specific models tested, or observations from the benchmark runs.
📖 Read the full source: HN AI Agents
👀 See Also

obsidian-mcp: Graph-Aware MCP Server for Claude with 25 Tools Targeting Large Vaults
obsidian-mcp is an MCP server exposing 25 tools (including get_note, traverse_graph, query_dataview, move_note, create_notes) that gives Claude graph-aware access to your Obsidian vault — avoiding context window death on 5k-note vaults. MIT, works with Claude Desktop, Claude Code, Cursor, Cline, Continue, Zed.

MCP Server Connects Claude to CellarTracker Wine Inventory
A developer built an MCP server that connects Claude directly to CellarTracker accounts, allowing conversational queries about wine inventory, tasting notes, purchase history, and drinking windows without manual CSV exports.

graphify-ts: Local MCP server cuts Claude Code PR review tokens from 63K to 8.7K
graphify-ts builds a local knowledge graph of your codebase using tree-sitter AST + Louvain communities + BM25 + optional ONNX rerank, exposing it via MCP stdio. In production tests, it reduced input tokens by 2.6x and latency by 2.8x for code queries, and cut PR review prompts from 63K to 8.7K tokens.

Canary: AI QA Agent for Automated Testing Based on Code Changes
Canary is an AI QA agent that reads codebases, analyzes pull request diffs, and generates end-to-end tests for affected user workflows. It connects to preview environments, runs tests, and comments results directly on PRs with recordings.