Artificial Intelligence 3 min read

VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models

The article introduces Microsoft’s open‑source VibeVoice project, detailing its long‑audio ASR‑7B and real‑time TTS‑0.5B models, the continuous speech tokenizer and next‑token diffusion techniques, and provides quick‑start instructions for online demos and local deployment via Hugging Face.

Geek Labs

May 3, 2026

VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models

Overview

VibeVoice is a highly popular open‑source speech AI project hosted on GitHub, released by Microsoft. It offers two major capabilities: automatic speech recognition (ASR) and text‑to‑speech synthesis (TTS).

Core Models

VibeVoice‑ASR‑7B : a long‑audio speech recognition model with the following features:

Supports processing a single 60‑minute audio file.

Recognizes more than 50 languages.

Automatically annotates speakers, timestamps, and transcribed content.

Allows custom hot‑word lists to improve accuracy on domain‑specific terminology.

VibeVoice‑Realtime‑0.5B : a real‑time speech synthesis model that provides:

Streaming text input.

Generation of up to 90‑minute long speech from text.

Multi‑language and multi‑style voice output.

Technical Principles

The core innovation of VibeVoice is the use of a continuous speech tokenizer that operates at an ultra‑low 7.5 Hz frame rate. This tokenizer preserves audio fidelity while dramatically improving computational efficiency for long sequences.

The project adopts a next‑token diffusion framework. A large language model first captures textual context and dialogue flow, then a diffusion head generates high‑fidelity acoustic details.

Quick Start

Online Experience :

ASR Playground: https://aka.ms/vibevoice-asr

Google Colab notebook for interactive testing.

Local Deployment :

VibeVoice‑ASR is integrated into the Hugging Face Transformers library, enabling direct usage with a few lines of Python code:

from transformers import AutoModelForCTC, AutoProcessor
model = AutoModelForCTC.from_pretrained("microsoft/VibeVoice-ASR")
processor = AutoProcessor.from_pretrained("microsoft/VibeVoice-ASR")

Applicable Scenarios

Meeting transcription.

Automatic subtitles for podcasts.

Voice dialogue systems.

Multilingual translation.

Voice content analysis.

GitHub: https://github.com/microsoft/VibeVoice<br/>Stars: 45,150<br/>Language: Python<br/>License: MIT

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Open-source Microsoft speech recognition Text‑to‑Speech Hugging Face VibeVoice

Written by

Geek Labs

Daily shares of interesting GitHub open-source projects. AI tools, automation gems, technical tutorials, open-source inspiration.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.