TinyFish Web Agent Outperforms Competitors in Benchmarking

The TinyFish Web Agent has proven to be a leading tool in tackling complex web tasks, achieving an 81.9% success rate on hard tasks in the Online-Mind2Web benchmark, which consists of 300 tasks across 136 live websites. This figure starkly contrasts with major competitors, such as OpenAI Operator, which managed only a 43.2% success rate on similar tasks.

The Online-Mind2Web benchmark is a rigorous measure of a web agent's capabilities, testing them on tasks ranging from easy, like browsing credit card offers on Marriott, to complex challenges such as booking event tickets with dynamic pricing. Tasks involve multiple steps with live websites, including handling form validation and pop-ups, making it a realistic test compared to other less reliable benchmarks like WebVoyager.

TinyFish distinguishes itself by handling compounding errors effectively. It drops only 15.6 points from easy to hard tasks compared to massive drops shown by other systems, highlighting its robustness in real-world scenarios. Notably, it has published all 300 task runs, including their 40 failures, which offers transparency into its performance characteristics and failure cases, such as infrastructure-level anti-bot blocks encountered on sites like apartments.com.

Developers looking for a robust web automation tool would find TinyFish's open-source cookbook repository of interest, which provides insight into its architecture and execution methodology.

📖 Read the full source: HN AI Agents

TinyFish Web Agent Outperforms Competitors in Web Task Benchmarking

👀 See Also

ClawControl v1.3.1 adds media support, voice dictation, and Linux packaging

Equibles: Self-Hosted MCP Server for U.S. Financial Data – SEC Filings, 13F, Insider Trades, FRED

ClaudeMeter: Open-Source macOS Menu Bar App for Real-Time Claude Usage Tracking

Rival-Review: A Cross-Model Review Loop for AI Agent Plans