TinyFish Web Agent Outperforms Competitors in Web Task Benchmarking

The TinyFish Web Agent has proven to be a leading tool in tackling complex web tasks, achieving an 81.9% success rate on hard tasks in the Online-Mind2Web benchmark, which consists of 300 tasks across 136 live websites. This figure starkly contrasts with major competitors, such as OpenAI Operator, which managed only a 43.2% success rate on similar tasks.
The Online-Mind2Web benchmark is a rigorous measure of a web agent's capabilities, testing them on tasks ranging from easy, like browsing credit card offers on Marriott, to complex challenges such as booking event tickets with dynamic pricing. Tasks involve multiple steps with live websites, including handling form validation and pop-ups, making it a realistic test compared to other less reliable benchmarks like WebVoyager.
TinyFish distinguishes itself by handling compounding errors effectively. It drops only 15.6 points from easy to hard tasks compared to massive drops shown by other systems, highlighting its robustness in real-world scenarios. Notably, it has published all 300 task runs, including their 40 failures, which offers transparency into its performance characteristics and failure cases, such as infrastructure-level anti-bot blocks encountered on sites like apartments.com.
Developers looking for a robust web automation tool would find TinyFish's open-source cookbook repository of interest, which provides insight into its architecture and execution methodology.
📖 Read the full source: HN AI Agents
👀 See Also

ClawControl v1.3.1 adds media support, voice dictation, and Linux packaging
ClawControl v1.3.1 is a cross-platform OpenClaw client that now supports image sharing, wake-word voice dictation, usage charts, and Linux AppImage/.deb packages. The release includes security updates requiring OpenClaw 2.19+ users to update Control UI Allowed Origins.

Equibles: Self-Hosted MCP Server for U.S. Financial Data – SEC Filings, 13F, Insider Trades, FRED
Equibles is an open-source MCP server that scrapes public U.S. financial data (SEC filings, 13F, insider/congressional trades, short data, FRED) and exposes it as MCP tools for any local LLM agent.

ClaudeMeter: Open-Source macOS Menu Bar App for Real-Time Claude Usage Tracking
ClaudeMeter is a free, open-source macOS menu bar app for Claude Max subscribers that displays session and weekly usage percentages, reset timers, and pace indicators without interrupting workflow. The entire app was built using Claude (Claude Code/Opus) for Swift code, Supabase backend, and Edge Functions.

Rival-Review: A Cross-Model Review Loop for AI Agent Plans
Rival-review is an MIT-licensed tool that uses a second AI model to audit plans from a primary AI coding agent before execution, catching issues like flawed rollback plans, security holes, and stale-state decisions.