Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps
Microsoft’s newly open‑sourced VibeVoice‑ASR model can transcribe up to 60‑minute audio in a single pass, preserving global context while providing built‑in speaker diarization and timestamps, supports 50+ languages, offers custom hot‑word injection, and can be deployed via Docker, Gradio, or vLLM for high‑throughput API services.
Problem with existing ASR
Whisper processes long audio by chopping it into 30‑second segments, which breaks context and does not provide native speaker diarization. Users must attach external models such as pyannote.audio, adding latency and complexity.
VibeVoice‑ASR Overview
VibeVoice‑ASR handles a single 60‑minute audio chunk with a 64K token window, preserving global context for more accurate transcription. It integrates automatic speech recognition (ASR), speaker diarization, and timestamping into one model, answering “Who?”, “When?” and “What?”.
Architecture
Core Features
🕒 60‑minute single‑pass : Supports up to 64K tokens, eliminating the need for segment‑wise processing.
👤 Custom hot‑words (Context Injection) : Domain‑specific terms can be supplied as prompts for accurate recognition.
📝 Three‑in‑one : Simultaneous ASR, speaker identification, and timestamp generation.
🌍 50+ language auto‑switch : Handles code‑switching between Chinese and English without manual language selection.
Installation
Docker, especially NVIDIA containers, is the recommended environment.
# 1. Start NVIDIA PyTorch container
sudo docker run --privileged --net=host --ipc=host \
--ulimit memlock=-1:-1 --ulimit stack=-1:-1 \
--gpus all --rm -it nvcr.io/nvidia/pytorch:25.12-py3
# 2. Clone and install
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .In a local Python environment, run pip install -e .; note that building flash‑attn may require additional steps.
Usage
Command‑line inference
python demo/vibevoice_asr_inference_from_file.py \
--model_path microsoft/VibeVoice-ASR \
--audio_files /path/to/your/audio.wavGradio web UI
# Requires ffmpeg installed
python demo/vibevoice_asr_gradio_demo.py --model_path microsoft/VibeVoice-ASR --shareProduction deployment with vLLM
Microsoft provides vLLM integration, allowing the model to be served as an OpenAI‑compatible API with high concurrency.
# Deploy as vLLM API service
docker run -d --gpus all --name vibevoice-vllm \
-p 8000:8000 \
-v $(pwd):/app \
vllm/vllm-openai:latest \
-c "python3 /app/vllm_plugin/scripts/start_server.py"After startup, the service can be called like GPT‑4, supporting streaming output.
Troubleshooting
CUDA out of memory : Reduce --gpu-memory-utilization, lower --max-num-seqs, or decrease --max-model-len to save memory.
Audio decoding failed : Ensure ffmpeg is installed inside the container ( ffmpeg -version) and that the audio format is supported.
Model not found : Verify the model directory contains config.json and weight files; regenerate missing tokenizer files if needed.
Plugin not loaded : Confirm the vibevoice package is installed ( pip show vibevoice) and that vLLM’s entry point has not been removed.
Fine‑tuning (Advanced)
LoRA fine‑tuning adapts the base model for domain‑specific vocabularies (e.g., medical or legal) without sacrificing inference speed.
Data format requires audio files (mp3/wav) and a JSON annotation containing audio_path, segments, and optionally customized_context for domain terms.
{
"audio_path": "0.mp3",
"segments": [ ... ],
"customized_context": ["term1", "term2", "example sentence with specific background."]
}Example LoRA fine‑tuning command (single‑GPU):
# Single‑GPU LoRA fine‑tune
torchrun --nproc_per_node=1 lora_finetune.py \
--model_path microsoft/VibeVoice-ASR \
--data_dir ./your_dataset \
--output_dir ./output \
--lora_r 16 \
--learning_rate 1e-4 \
--bf16After training, LoRA weights can be merged back into the base model to retain original inference speed.
Evaluation
Official benchmark on the MLC‑Challenge dataset shows strong performance, especially in DER (speaker diarization error rate). Vietnamese DER = 0.16, Japanese DER = 0.82, and Chinese performs well on the AISHELL‑4 complex scenario.
Compared with traditional “concatenated” long‑audio pipelines, VibeVoice’s end‑to‑end long‑context model maintains speaker identity across minutes, remembering who spoke 30 minutes earlier.
Conclusion
VibeVoice‑ASR addresses key pain points of the large‑model era: ultra‑long audio, native speaker diarization, and custom vocabulary correction. It is suited for meeting transcription, podcast conversion, and long‑video subtitle generation, and offers straightforward deployment via Docker, Gradio, or high‑throughput vLLM services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
