Spec27: Spec-Driven Validation for AI Agents – API-Level Testing Without Internal Access

Safe Intelligence has launched Spec27, a spec-driven validation tool for AI agents. Unlike traditional LLM eval frameworks that score general model behavior, Spec27 lets teams define reusable specifications for the specific mission an agent must fulfill. Tests are generated automatically from those specs and run against the agent's primary interfaces only — no assumption about internal stack, no SDKs or gateways required.
Key Features
- Outside-in testing: All tests execute against the agent's exposed API or UI. No need to instrument the agent's internals, which is crucial for agents built on vendor platforms where you don't control the stack.
- Spec-driven test generation: Define specs in terms of expected behavior (e.g., “when asked X, must do Y and not Z”). Spec27 auto-generates adversarial and robustness checks, surfacing sensitivities and regressions as models, prompts, or tools change.
- Early access: Currently strongest for single-turn agent and application validation. Multi-turn interactions and richer telemetry/tool-call integration are on the roadmap.
Who Is It For
Teams deploying internal agents, vendor agents, or any AI system where reliability matters more than benchmark scores. If you're testing agents on platforms that don't expose internals, Spec27's black-box approach directly addresses that gap.
Getting Started
Spec27 is open to try for HN readers. The launch site offers a sample flow so you can explore without setup. Sign up at spec27.ai/launch.
📖 Read the full source: HN AI Agents
👀 See Also

Orkestra: Cost-Aware LLM Routing Layer for OpenClaw Reduces API Costs by 60-80%
Orkestra is a modular routing layer that sits in front of LLM calls in OpenClaw, using semantic classification to route prompts to budget, balanced, or premium model tiers. The approach reduced API costs by 60-80% without prompt rewriting or complex rules.

Cross-Model Review Loop for AI Coding Agents Catches Critical Planning Flaws
A developer built a cross-model review system where a second AI model reviews plans from coding agents before execution, catching critical flaws like rollback failures and security holes. The tool is MIT licensed and includes a TUI dashboard.

MCP Server Adds Persistent Memory with Retrieval Scoring to Claude Code
A developer built an MCP server called engram-mcp that gives Claude Code persistent memory across sessions and projects, featuring automatic retrieval scoring based on outcome success and drift detection for stale knowledge.

Open-source AI job search system built with Claude Code evaluates offers, generates tailored resumes
A developer open-sourced a Claude Code project that turns your terminal into a job search command center. The system evaluates job offers across 10 dimensions, generates ATS-optimized PDF resumes, scans 45+ company career pages, and includes 14 skill modes.