Automated Testing: Canary AI QA Agent for PRs

What Canary Does

Canary builds AI agents that connect to your codebase to understand application structure including routes, controllers, and validation logic. When you push a pull request, it reads the diff, understands the intent behind changes, then generates and executes tests against your preview app to check real user workflows end-to-end.

Key Features

Analyzes PR diffs to understand what actually changed
Generates and runs tests for every affected user workflow
Comments directly on PRs with test results and screen recordings
Flags behaviors that don't match expectations
Allows triggering specific user workflow tests via PR comments
Tests generated from PRs can be moved into regression suites
Create tests by prompting in plain English - Canary generates full test suites from your codebase
Schedules and runs tests continuously

Technical Approach

This isn't something a single foundation model can handle alone according to the founders. QA spans multiple modalities: source code, DOM/ARIA, device emulators, visual verifications, screen recording analysis, network/console logs, and live browser state. The system requires custom browser fleets, user sessions, ephemeral environments, on-device farms, and data seeding to run tests reliably.

Catching second-order effects of code changes requires a specialized harness that breaks applications in multiple possible ways across different user types that normal happy path testing wouldn't cover.

Benchmark Results

The team published QA-Bench v0, the first benchmark for code verification. They tested their purpose-built QA agent against GPT 5.4, Claude Code (Opus 4.6), and Sonnet 4.6 across 35 real PRs on Grafana, Mattermost, Cal.com, and Apache Superset. Tests measured three dimensions: Relevance, Coverage, and Coherence.

Coverage showed the largest performance gap. Canary leads by:

11 points over GPT 5.4
18 points over Claude Code
26 points over Sonnet 4.6

Real-World Example

One construction tech customer had an invoicing flow where the amount due drifted from the original proposal total by approximately $1,600. Canary caught this regression in their invoice flow before release.

Founder Background

The founders previously built AI coding tools at Windsurf, Cognition, and Google. They observed that while AI tools made teams faster at shipping, nobody was testing real user behavior before merge, leading to production issues in checkout, auth, and billing flows.

📖 Read the full source: HN AI Agents