Microsoft VibeVoice: 60-Min ASR and 90-Min TTS Models Open-Sourced

✍️ OpenClawRadar📅 Published: April 28, 2026🔗 Source
Microsoft VibeVoice: 60-Min ASR and 90-Min TTS Models Open-Sourced
Ad

Microsoft open-sourced VibeVoice, a family of frontier voice AI models covering both ASR and TTS. The ASR model (VibeVoice-ASR-7B) handles up to 60 minutes of long-form audio in a single pass (64K token window), outputting structured transcriptions with speaker ID, timestamps, and text — supporting over 50 languages. It also supports user-customized hotwords for domain-specific terms. The TTS model (VibeVoice-TTS-1.5B) can synthesize up to 90 minutes of multi-speaker speech (up to 4 speakers). A real-time variant (VibeVoice-Realtime-0.5B) supports streaming text input and long-form generation with multilingual voices (9 languages) and 11 English style voices.

Ad

Key Technical Details

  • Core innovation: Continuous speech tokenizers (Acoustic and Semantic) at an ultra-low frame rate of 7.5 Hz, preserving audio fidelity while boosting computational efficiency for long sequences.
  • Architecture: Next-token diffusion framework — an LLM handles textual context and dialogue flow, a diffusion head generates high-fidelity acoustic details.
  • ASR capabilities: Single-pass 60-minute audio, joint ASR + diarization + timestamping (Who, When, What), customizable hotwords.
  • TTS capabilities: 90-minute long-form synthesis with up to 4 distinct speakers; real-time streaming via VibeVoice-Realtime-0.5B.
  • Inference speedup: vLLM inference supported (see vllm-asr).
  • Finetuning: ASR finetuning code is available.
  • Hugging Face integration: VibeVoice-ASR is now part of the Transformers release (2026-03-06).

Quick links:

Note: The VibeVoice-TTS code was removed from the repo (2025-09-05) due to misuse concerns, but ASR and realtime TTS code remain active.

📖 Read the full source: HN AI Agents

Ad

👀 See Also