Anam Cara-3: Faster, Interactive AI Avatars with Audio-to-Video Pipeline

Anam has released its latest model, cara-3, designed to create interactive avatars. The avatar utilizes a two-stage pipeline where a diffusion transformer converts audio into motion embeddings (including head position, eye gaze, lip shape, and expression). These embeddings are then applied to a reference image to generate video frames, allowing for animation of any face without the need for retraining.

Notably, Cara-3 can achieve a time-to-first-frame of approximately 70ms on an H200, which supports many concurrent avatar sessions on a single GPU. This speed is partly due to the novel flow matching variant used for audio-to-motion transformation, as conventional techniques proved unstable.

An independent blind evaluation showed that Cara-3 outperformed competitors like HeyGen, Tavus, and D-ID, scoring 24% higher on average across various metrics. Responsiveness, as evidenced by a Spearman correlation coefficient of 0.697, is shown to impact user experience more than visual quality (0.473).

Anam has also open-sourced their training data pipeline backbone, Metaxy, to facilitate iterative development without retaking costly steps.

📖 Read the full source: HN AI Agents

Anam Cara-3: Advancements in Interactive AI Avatars

👀 See Also

Gemini 3.1 Flash Live: Google's latest audio model with improved benchmarks and watermarking

Analysis of OpenClaw's Astroturfing Campaign and $CLAWD Token Pump

Claude Code existential crisis: AI enters infinite loop, tries kill -9, System.exit(0), and :wq to end own response

Anthropic study reveals cognitive degradation in AI-assisted workflows