Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps

Microsoft’s newly open‑sourced VibeVoice‑ASR model can transcribe up to 60‑minute audio in a single pass, preserving global context while providing built‑in speaker diarization and timestamps, supports 50+ languages, offers custom hot‑word injection, and can be deployed via Docker, Gradio, or vLLM for high‑throughput API services.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps

Problem with existing ASR

Whisper processes long audio by chopping it into 30‑second segments, which breaks context and does not provide native speaker diarization. Users must attach external models such as pyannote.audio, adding latency and complexity.

VibeVoice‑ASR Overview

VibeVoice‑ASR handles a single 60‑minute audio chunk with a 64K token window, preserving global context for more accurate transcription. It integrates automatic speech recognition (ASR), speaker diarization, and timestamping into one model, answering “Who?”, “When?” and “What?”.

Architecture

VibeVoice ASR Architecture
VibeVoice ASR Architecture

Core Features

🕒 60‑minute single‑pass : Supports up to 64K tokens, eliminating the need for segment‑wise processing.

👤 Custom hot‑words (Context Injection) : Domain‑specific terms can be supplied as prompts for accurate recognition.

📝 Three‑in‑one : Simultaneous ASR, speaker identification, and timestamp generation.

🌍 50+ language auto‑switch : Handles code‑switching between Chinese and English without manual language selection.

Installation

Docker, especially NVIDIA containers, is the recommended environment.

# 1. Start NVIDIA PyTorch container
sudo docker run --privileged --net=host --ipc=host \
  --ulimit memlock=-1:-1 --ulimit stack=-1:-1 \
  --gpus all --rm -it nvcr.io/nvidia/pytorch:25.12-py3

# 2. Clone and install
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .

In a local Python environment, run pip install -e .; note that building flash‑attn may require additional steps.

Usage

Command‑line inference

python demo/vibevoice_asr_inference_from_file.py \
    --model_path microsoft/VibeVoice-ASR \
    --audio_files /path/to/your/audio.wav

Gradio web UI

# Requires ffmpeg installed
python demo/vibevoice_asr_gradio_demo.py --model_path microsoft/VibeVoice-ASR --share

Production deployment with vLLM

Microsoft provides vLLM integration, allowing the model to be served as an OpenAI‑compatible API with high concurrency.

# Deploy as vLLM API service
docker run -d --gpus all --name vibevoice-vllm \
  -p 8000:8000 \
  -v $(pwd):/app \
  vllm/vllm-openai:latest \
  -c "python3 /app/vllm_plugin/scripts/start_server.py"

After startup, the service can be called like GPT‑4, supporting streaming output.

Troubleshooting

CUDA out of memory : Reduce --gpu-memory-utilization, lower --max-num-seqs, or decrease --max-model-len to save memory.

Audio decoding failed : Ensure ffmpeg is installed inside the container ( ffmpeg -version) and that the audio format is supported.

Model not found : Verify the model directory contains config.json and weight files; regenerate missing tokenizer files if needed.

Plugin not loaded : Confirm the vibevoice package is installed ( pip show vibevoice) and that vLLM’s entry point has not been removed.

Fine‑tuning (Advanced)

LoRA fine‑tuning adapts the base model for domain‑specific vocabularies (e.g., medical or legal) without sacrificing inference speed.

Data format requires audio files (mp3/wav) and a JSON annotation containing audio_path, segments, and optionally customized_context for domain terms.

{
  "audio_path": "0.mp3",
  "segments": [ ... ],
  "customized_context": ["term1", "term2", "example sentence with specific background."]
}

Example LoRA fine‑tuning command (single‑GPU):

# Single‑GPU LoRA fine‑tune
torchrun --nproc_per_node=1 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./your_dataset \
    --output_dir ./output \
    --lora_r 16 \
    --learning_rate 1e-4 \
    --bf16

After training, LoRA weights can be merged back into the base model to retain original inference speed.

Evaluation

Official benchmark on the MLC‑Challenge dataset shows strong performance, especially in DER (speaker diarization error rate). Vietnamese DER = 0.16, Japanese DER = 0.82, and Chinese performs well on the AISHELL‑4 complex scenario.

Compared with traditional “concatenated” long‑audio pipelines, VibeVoice’s end‑to‑end long‑context model maintains speaker identity across minutes, remembering who spoke 30 minutes earlier.

Conclusion

VibeVoice‑ASR addresses key pain points of the large‑model era: ultra‑long audio, native speaker diarization, and custom vocabulary correction. It is suited for meeting transcription, podcast conversion, and long‑video subtitle generation, and offers straightforward deployment via Docker, Gradio, or high‑throughput vLLM services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerLoRAvLLMSpeech Recognitionspeaker diarizationASRVibeVoice
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.