Canary: AI QA Agent for Automated Testing Based on Code Changes

What Canary Does
Canary builds AI agents that connect to your codebase to understand application structure including routes, controllers, and validation logic. When you push a pull request, it reads the diff, understands the intent behind changes, then generates and executes tests against your preview app to check real user workflows end-to-end.
Key Features
- Analyzes PR diffs to understand what actually changed
- Generates and runs tests for every affected user workflow
- Comments directly on PRs with test results and screen recordings
- Flags behaviors that don't match expectations
- Allows triggering specific user workflow tests via PR comments
- Tests generated from PRs can be moved into regression suites
- Create tests by prompting in plain English - Canary generates full test suites from your codebase
- Schedules and runs tests continuously
Technical Approach
This isn't something a single foundation model can handle alone according to the founders. QA spans multiple modalities: source code, DOM/ARIA, device emulators, visual verifications, screen recording analysis, network/console logs, and live browser state. The system requires custom browser fleets, user sessions, ephemeral environments, on-device farms, and data seeding to run tests reliably.
Catching second-order effects of code changes requires a specialized harness that breaks applications in multiple possible ways across different user types that normal happy path testing wouldn't cover.
Benchmark Results
The team published QA-Bench v0, the first benchmark for code verification. They tested their purpose-built QA agent against GPT 5.4, Claude Code (Opus 4.6), and Sonnet 4.6 across 35 real PRs on Grafana, Mattermost, Cal.com, and Apache Superset. Tests measured three dimensions: Relevance, Coverage, and Coherence.
Coverage showed the largest performance gap. Canary leads by:
- 11 points over GPT 5.4
- 18 points over Claude Code
- 26 points over Sonnet 4.6
Real-World Example
One construction tech customer had an invoicing flow where the amount due drifted from the original proposal total by approximately $1,600. Canary caught this regression in their invoice flow before release.
Founder Background
The founders previously built AI coding tools at Windsurf, Cognition, and Google. They observed that while AI tools made teams faster at shipping, nobody was testing real user behavior before merge, leading to production issues in checkout, auth, and billing flows.
📖 Read the full source: HN AI Agents
👀 See Also

OpenClaw vs Hermes: Different Design Philosophies for AI Agents
OpenClaw is a multi-channel gateway connecting WhatsApp, Telegram, Discord, Slack, and iMessage with a massive skill ecosystem, while Hermes is a learning agent that evaluates tasks, saves patterns as reusable skills, and builds a model of your workflow over time.

Developer Builds Scheme Compiler to WASM Using AI in 4 Days
A developer created Puppy Scheme, a Scheme compiler that targets WebAssembly, in about 4 days using AI assistance. The compiler supports 73% of R5RS and R7RS, uses WASM GC, and achieved compilation time improvements from 3½ minutes to 11 seconds overnight.

Run local LLMs on your phone with Observer: offline agents for monitoring and logging
Observer is an open-source iOS app that runs multimodal LLMs locally on your phone to monitor events, log data, and trigger Discord notifications — all offline and free.

TinyFish Web Agent Outperforms Competitors in Web Task Benchmarking
TinyFish's web agent achieved an 81.9% success rate on hard web tasks, significantly outperforming competitors like OpenAI Operator at 43.2%.