Open-Source ASR That Runs Faster on CPU Than Whisper on GPU
FunASR is an industrial‑grade, open‑source speech‑recognition toolkit that combines VAD, transcription, punctuation, speaker diarization and emotion detection in one call, achieving up to 170× real‑time on GPU and 17× on CPU, outperforming Whisper while supporting 50+ languages and offering OpenAI‑compatible APIs.
Overview
FunASR is an open‑source, industrial‑grade, end‑to‑end speech‑recognition toolkit from Alibaba’s ModelScope team. It is MIT‑licensed and hosted at https://github.com/modelscope/FunASR with the latest PyPI release funasr 1.3.9.
It splits ASR into five components and welds them together with a single AutoModel:
ASR – default models: SenseVoice‑Small, Paraformer, Fun‑ASR‑Nano
VAD – default model: fsmn‑vad
Punctuation – default model: ct‑punc
Speaker diarization – default model: cam++
Emotion – default model: emotion2vec+large
Each component can be swapped or upgraded independently, and the combined pipeline can achieve up to 170× real‑time on GPU and 17× on CPU.
Key Features
Speed : SenseVoice‑Small processes 1 h audio in < 22 s on GPU (170× real‑time) and 17× real‑time on CPU.
Language coverage : Fun‑ASR‑Nano supports 31 languages; Qwen3‑ASR detects 52 languages automatically; GLM‑ASR‑Nano optimises 17 Chinese dialects.
One‑stop pipeline : VAD, transcription, punctuation, speaker diarization, and emotion are performed in a single call.
Emotion recognition : emotion2vec+large outputs labels such as happy, sad, angry.
Streaming and offline modes : paraformer‑zh‑streaming for WebSocket real‑time subtitles; paraformer‑zh or SenseVoice for offline long audio.
OpenAI‑compatible API : funasr-server --device cuda starts a service exposing /v1/audio/transcriptions identical to Whisper’s API.
Agent integration : built‑in MCP service can attach to Claude/Cursor; OpenAI‑compatible endpoint works with LangChain, Dify, AutoGen.
Installation
pip install funasrSource installation:
git clone https://github.com/modelscope/FunASR.git
cd FunASR
pip install -e ./Requirements: Python ≥ 3.8, PyTorch ≥ 1.13, torchaudio.
Start the OpenAI‑compatible server:
pip install funasr fastapi uvicorn python-multipart
funasr-server --model sensevoice --device cudaUsage Examples
Chinese meeting transcription (VAD + recognition + punctuation + speaker)
from funasr import AutoModel
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
spk_model="cam++",
device="cuda",
)
result = model.generate(input="meeting.wav")Sample output:
[00:00.4 → 00:03.8] Speaker0: 我们今天讨论一下 Q3 的计划
[00:04.2 → 00:07.1] Speaker1: 好的,我有三个要点
[00:07.5 → 00:12.3] Speaker0: 请讲,我们还有 30 分钟Multi‑language / dialect with Fun‑ASR‑Nano
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
vad_model="fsmn-vad",
device="cuda",
)
result = model.generate(input="meeting.wav")Batch processing with vLLM acceleration
from funasr.auto.auto_model_vllm import AutoModelVLLM
model = AutoModelVLLM(
model="FunAudioLLM/Fun-ASR-Nano-2512",
tensor_parallel_size=1,
)
results = model.generate(["audio1.wav", "audio2.wav"], language="auto")Streaming real‑time transcription
model = AutoModel(model="paraformer-zh-streaming", device="cuda")
result = model.generate(
input="chunk.wav",
cache={},
chunk_size=[0, 10, 5],
) chunk_size=[0, 10, 5]is a typical latency/look‑ahead configuration for live subtitles.
Emotion detection
model = AutoModel(model="emotion2vec_plus_large", device="cuda")
result = model.generate(input="audio.wav", granularity="utterance")Result returns an emotion label such as happy, sad, angry.
Benchmark Data
Official benchmark (source: modelscope.github.io/FunASR/zh/benchmark.html):
SenseVoice‑Small: 170× GPU, 17× CPU, 13× faster than Whisper‑large‑v3.
Paraformer‑Large: 120× GPU, 15× CPU, 9× faster than Whisper.
Whisper‑large‑v3‑turbo: 46× GPU, cannot run on CPU.
Fun‑ASR‑Nano: 17× GPU, 3.6× CPU, 1.3× faster than Whisper.
For a one‑hour meeting audio, Whisper‑large‑v3 needs 4.6 minutes, while SenseVoice‑Small finishes in 21 seconds (13× speed gap). On CPU, SenseVoice‑Small’s 17× real‑time exceeds Whisper‑large‑v3’s 13× real‑time on GPU.
Pros
One‑stop pipeline reduces the need to stitch multiple repositories.
Best Chinese performance: Paraformer series benefits from eight years of Alibaba DAMO research, surpassing Whisper on dialects, accents, and noise robustness.
CPU‑friendly: works on servers without GPUs.
Service‑ready: funasr‑server provides an OpenAI‑compatible API, allowing existing Whisper SDKs to be reused.
Agent integration: MCP service, OpenAI API, and Gradio demo are all provided.
Cons
Many model choices can be confusing for newcomers.
SenseVoice‑Small is fast but only 234 M parameters; Whisper‑large (1.55 B) still has a slight edge on complex English audio.
Fun‑ASR‑Nano achieves best speed with vLLM, which has a non‑trivial installation overhead.
Documentation mixes Chinese and English; some API parameters are only described in example scripts.
Deployment Recommendations
Chinese meeting transcription – use Paraformer‑zh + cam++ + ct‑punc.
Multi‑language / Chinese dialect – use Fun‑ASR‑Nano (800 M, 31 languages with dialects).
Global 52 languages – use Qwen3‑ASR (1.7 B) with automatic language detection.
Live streaming subtitles – use paraformer‑zh‑streaming (WebSocket).
Emotion analysis / customer‑service QA – use emotion2vec+large (stand‑alone).
Servers without GPUs – use SenseVoice‑Small on CPU (17× real‑time).
Migration from Whisper – run funasr‑server --model sensevoice to expose a Whisper‑compatible endpoint.
Conclusion
FunASR industrialises the entire speech pipeline, adding dialect support, streaming, speaker diarization, emotion detection, and Agent integration under an MIT license. For Chinese‑focused applications such as meeting minutes, customer‑service QA, or live subtitles, FunASR provides a ready‑to‑use solution with significant speed advantages over Whisper, especially on CPU‑only servers. For pure English or GPU‑only offline workloads, Whisper remains a viable alternative.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
