NerfGuard: A Classifier That Routes Coding Requests to Cheaper Models, Cutting Spend 3x

A team that switched from Claude Code to Codex for speed and steerability found themselves hitting per-token pricing hard. Their daily bill was striking, and they noticed they were using top-tier models on max reasoning for every task, even trivial ones. So they built NerfGuard — a fast classifier that routes each request to the least expensive model and reasoning depth required.
The core is a classifier that determines the minimum intelligence needed for a given coding request. On top of that, it applies automated token efficiency techniques. The result: roughly the same quality for multiples lower token spend, and because intelligence and reasoning are properly bin-packed, speed also goes up considerably. The team observed up to 3x savings and hours per day per person saved waiting on tool turns and agent responses.
Key details from the source:
- Classifier routes to cheapest model + reasoning depth for each request
- Additional automatic token efficiency techniques
- Result: 3x usage for same spend
- Speed improvements: hours per day per person saved
- More usage before hitting throttling limits
This is currently in use by engineers at multiple AI companies. The tool is available at nerfguard.com.
Who it's for: Teams using coding agents (Claude Code, Codex, etc.) who want to maximize output per dollar and reduce wait times.
📖 Read the full source: HN AI Agents
👀 See Also

Memento v1.0: Local Persistent Memory for AI Coding Agents
Memento v1.0 is a fully local memory layer for AI coding agents that runs embeddings, storage, and search on your machine with no cloud dependencies. It uses all-MiniLM-L6-v2 embeddings, HNSW indexing, and supports multiple IDEs with 17 MCP tools.

ToolLoop: Open-Source Agent Framework for Claude-Style Tools with Any Model
ToolLoop is an open-source Python framework with 11 tools for file operations, code search, shell access, and sub-agents that works with any LLM through LiteLLM. The 2,700-line framework allows switching models mid-conversation with shared context.

Forge: Turn a Mac or Linux Machine into an Always-On Dev Host for AI Coding Agents
Forge is an open-source tool that installs a daemon to turn any Mac or Linux machine into a permanent, always-on development host. It keeps AI coding agents running when you walk away, provides a web dashboard for monitoring, and uses Tailscale for secure remote access via SSH.

cortex-engine MCP server adds persistent memory and multi-agent support
cortex-engine v0.4.0 is an open-source MCP server that gives AI agents persistent long-term memory with tools like observe(), query(), believe(), and dream(). It now supports multiple agents with isolated memory namespaces.