Microsoft Open-Sources VibeVoice: 60-min ASR & 90-min TTS Models

Microsoft open-sourced VibeVoice, a family of frontier voice AI models covering both ASR and TTS. The ASR model (VibeVoice-ASR-7B) handles up to 60 minutes of long-form audio in a single pass (64K token window), outputting structured transcriptions with speaker ID, timestamps, and text — supporting over 50 languages. It also supports user-customized hotwords for domain-specific terms. The TTS model (VibeVoice-TTS-1.5B) can synthesize up to 90 minutes of multi-speaker speech (up to 4 speakers). A real-time variant (VibeVoice-Realtime-0.5B) supports streaming text input and long-form generation with multilingual voices (9 languages) and 11 English style voices.

Key Technical Details

Core innovation: Continuous speech tokenizers (Acoustic and Semantic) at an ultra-low frame rate of 7.5 Hz, preserving audio fidelity while boosting computational efficiency for long sequences.
Architecture: Next-token diffusion framework — an LLM handles textual context and dialogue flow, a diffusion head generates high-fidelity acoustic details.
ASR capabilities: Single-pass 60-minute audio, joint ASR + diarization + timestamping (Who, When, What), customizable hotwords.
TTS capabilities: 90-minute long-form synthesis with up to 4 distinct speakers; real-time streaming via VibeVoice-Realtime-0.5B.
Inference speedup: vLLM inference supported (see vllm-asr).
Finetuning: ASR finetuning code is available.
Hugging Face integration: VibeVoice-ASR is now part of the Transformers release (2026-03-06).