What is Microsoft VibeVoice?

VibeVoice is Microsoft's open-source family of frontier voice AI models, including VibeVoice-ASR (automatic speech recognition) and VibeVoice-Realtime-0.5B (text-to-speech). The ASR model can transcribe up to 60 minutes of audio in a single pass with speaker identification and timestamps. The TTS model can synthesize up to 90 minutes of speech with multiple distinct speakers.

What happened to the original VibeVoice TTS model?

Microsoft open-sourced VibeVoice-TTS in August 2025, but removed it from the repository in September 2025 after discovering misuse — people using it to clone voices without consent. Microsoft cited responsible AI principles for the removal. The ASR model remains fully available.

How does VibeVoice ASR work technically?

VibeVoice uses continuous speech tokenizers — Acoustic and Semantic — operating at an ultra-low frame rate of 7.5 Hz. This approach preserves audio fidelity while reducing computational overhead. The result is a model that can handle long-form audio without the chunking and stitching that makes most ASR systems inaccurate on long recordings.

Is VibeVoice available through Hugging Face?

Yes. As of March 6, 2026, VibeVoice-ASR is part of the official Hugging Face Transformers library (v5.3.0 release). You can load and use it through the standard transformers API without any custom code.

What apps are already using VibeVoice?

Vibing — a voice-powered input method — launched on March 29, 2026 using VibeVoice-ASR as its core engine. It's available for macOS and Windows. Given the 27,000+ GitHub stars and Transformers integration, more third-party applications are expected to follow.

Microsoft VibeVoice Hits 27K GitHub Stars as Open-Source Voice AI Goes Mainstream

Open-source voice AI has had a credibility problem. Most projects are too slow for real-time use, fall apart on audio longer than a few minutes, or require cloud infrastructure that defeats the point of open-source. Microsoft's VibeVoice is making a case that the field has genuinely matured.

The project — a family of open-source frontier voice AI models covering both speech recognition and text-to-speech — has climbed past 27,000 GitHub stars and hit number two on GitHub Trending, adding over 1,190 stars in a single day. That growth spike followed the March 6 integration of VibeVoice-ASR into Hugging Face Transformers v5.3.0, which turned an interesting research project into something developers can drop into production without custom plumbing.

What VibeVoice Can Actually Do

The headline capability of VibeVoice-ASR is long-form audio transcription at a level that simply wasn't available in open-source tooling a year ago. The model handles 60-minute audio files in a single pass — producing structured output that includes who is speaking, when they speak, and what they said. That's speaker diarization, timestamping, and transcription in one inference call, natively multilingual across 50+ languages.

The technical innovation behind this is the continuous speech tokenizer architecture. Most ASR systems chunk audio into short segments (typically 30 seconds) and stitch the results together — a process that introduces errors at boundaries, loses track of speakers across chunks, and degrades on audio with overlapping speech. VibeVoice's tokenizers operate at an ultra-low frame rate of 7.5 Hz, compressing the audio representation efficiently enough to process extended recordings as a single sequence.

The result is a model that Microsoft's research team describes as designed specifically for "long-form audio in a single pass." The technical report on arXiv — published in January 2026 — provides the architecture details. The ICLR 2026 acceptance of the TTS model (as an oral presentation, one of the highest recognition tiers) suggests the core research holds up to peer review.

The TTS Situation Is More Complicated

VibeVoice's text-to-speech side has a complicated history. Microsoft open-sourced VibeVoice-TTS in August 2025 — a model capable of generating up to 90 minutes of speech with up to four distinct speakers. The model was immediately recognized as technically impressive: accepted as an oral presentation at ICLR 2026 before it was even published.

But within two weeks of the TTS release, Microsoft removed the code from the repository. The explanation was brief: "We discovered instances where the tool was used in ways inconsistent with the stated intent." Voice cloning is the obvious concern — a model that synthesizes realistic multi-speaker audio is a powerful tool for creating synthetic versions of real voices, with all the fraud and disinformation implications that carries.

Microsoft hasn't said whether VibeVoice-TTS will be re-released with additional safeguards, or whether it's permanently restricted. The ASR model — which transcribes speech rather than generating it — remains fully available and is the component driving the current adoption surge.

Developer Adoption Is Accelerating

The Transformers integration is what shifted VibeVoice from "interesting research" to "production-ready tool." Before March 6, using VibeVoice required pulling code from the GitHub repository and navigating custom inference pipelines. After March 6, it's a few lines of Hugging Face API calls.

That accessibility shows up in the adoption curve. Vibing — a voice input method for macOS and Windows built on VibeVoice-ASR — launched on March 29, just over three weeks after the Transformers integration. It's a consumer-facing app that turns VibeVoice's transcription capability into a voice-to-text input method, which is exactly the kind of application the Transformers integration unlocked.

More integrations are coming. With 27,000 stars and native Transformers support, VibeVoice-ASR is now the obvious starting point for any developer building applications that need accurate long-form speech recognition — meeting transcription tools, podcast processing pipelines, voice-first interfaces, accessibility software.

Where This Fits in the Voice AI Landscape

VibeVoice's rise reflects a broader shift in the voice AI ecosystem. For most of AI's recent history, production-quality speech recognition meant OpenAI's Whisper or proprietary APIs from Google, Amazon, or Microsoft's own Azure Speech services. Whisper was a genuine breakthrough when it launched in 2022, but it has architectural limitations on long audio — the 30-second chunking is a real constraint for enterprise use cases.

VibeVoice-ASR appears to be the first open-source model that meaningfully surpasses Whisper's long-form transcription accuracy while also providing structured output. For enterprise developers who want on-premise deployment, that's a significant capability unlock.

Microsoft hasn't announced commercial licensing terms, but VibeVoice is currently published under a research license that permits derivative use. The Transformers integration suggests Microsoft intends for the model to see broad adoption — even if the TTS component remains on hold while the voice cloning risks are managed.

VibeVoice-ASR is available via Hugging Face Transformers and the Microsoft GitHub repository. The vLLM inference backend is also supported for high-throughput deployments.

Microsoft VibeVoice Hits 27K GitHub Stars as Open-Source Voice AI Goes Mainstream

What VibeVoice Can Actually Do

The TTS Situation Is More Complicated

Developer Adoption Is Accelerating

Where This Fits in the Voice AI Landscape

More in Tools

Higgsfield launches Supercomputer, a cloud-native AI agent for end-to-end creative production

Rivian Rolls Out 'Hey Rivian' AI Assistant in 2026.15 Update With Google Calendar Sync

OpenAI Launches 'Daybreak' Cybersecurity Platform to Find and Fix Bugs Before Attackers Do