Building a Sub-500ms Voice Agent: Architecture and Performance Insights

Voice Agent Architecture and Performance
Nick Tikhonov built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). This includes full STT → LLM → TTS in the loop with clean barge-ins and no precomputed responses. The implementation outperformed Vapi's equivalent setup by 2× on latency.
Core Technical Insights
The key realization was that voice is a turn-taking problem, not a transcription problem. Voice Activity Detection (VAD) alone fails; semantic end-of-turn detection is required. The system reduces to one loop with two states: speaking vs listening.
The critical transitions are:
- Cancel instantly on barge-in
- Respond instantly on end-of-turn
Technical Requirements
STT → LLM → TTS must stream. Sequential pipelines are ineffective for natural conversation. Time To First Token (TTFT) dominates everything in voice interfaces - the first token is the critical path. Groq's ~80ms TTFT was identified as the single biggest performance win.
Infrastructure Considerations
Geography matters more than prompts. All components must be colocated or latency becomes prohibitive before the system even starts processing. The build took approximately one day and roughly $100 in API credits.
Why Voice Agents Are Challenging
Voice agents represent a significant complexity increase compared to text agents. The orchestration is continuous and real-time, requiring careful management of multiple models simultaneously. The system must constantly decide whether the user is speaking or listening, with transitions between these states being the most difficult aspect.
When the user starts speaking, the agent must immediately stop talking - cancel generation, cancel speech synthesis, and flush any buffered audio. When the user stops speaking, the system must confidently decide they're done and start responding with minimal delay.
Architecture Approach
The developer started by iterating on architecture with ChatGPT outside the editor to build a mental model first. The entire problem was reduced to a single loop and a tiny state machine. The core question a voice agent needs to answer is: is the user speaking, or listening?
The two states are:
- The user is speaking
- The user is listening
This turn-detection logic forms the core of every voice system. The implementation is available on GitHub for reference and further development.
📖 Read the full source: HN AI Agents
👀 See Also

Four Claude Code Hooks Enforce Voice and Tone Consistency in AI-Written Copy
A developer built a system using four Claude Code hooks to prevent AI-generated copy from drifting off-brand. The system gates editing of copy files (.tsx, .md) until a reviewer agent validates content against a VOICE-AND-TONE.md guide.
TextGen (text-generation-webui) Becomes Native Desktop App with Portable Builds
TextGen, the open-source alternative to LM Studio, has evolved from a web UI to a no-install desktop app for Windows, Linux, and macOS with portable builds, full privacy, and advanced quantization support.

Oh-My-Mermaid: Claude Code Skill for Auto-Generating Architecture Diagrams
Oh-My-Mermaid is a Claude Code skill that analyzes codebases and automatically generates Mermaid architecture diagrams and documentation. It's installed via npm and used with the /omm-scan command in Claude Code.

Architect CLI: Open-source tool for orchestrating headless AI coding agents in CI/CD
Architect is an open-source CLI tool designed for autonomous AI coding agents in CI/CD pipelines, featuring the Ralph Loop for test-retry cycles, deterministic guardrails, YAML pipeline definitions, and support for multiple LLMs via LiteLLM.