Artificial Intelligence 13 min read

Open-Source ASR That Runs Faster on CPU Than Whisper on GPU

FunASR is an industrial‑grade, open‑source speech‑recognition toolkit that combines VAD, transcription, punctuation, speaker diarization and emotion detection in one call, achieving up to 170× real‑time on GPU and 17× on CPU, outperforming Whisper while supporting 50+ languages and offering OpenAI‑compatible APIs.

Old Zhang's AI Learning

Jun 9, 2026

Open-Source ASR That Runs Faster on CPU Than Whisper on GPU

Overview

FunASR is an open‑source, industrial‑grade, end‑to‑end speech‑recognition toolkit from Alibaba’s ModelScope team. It is MIT‑licensed and hosted at https://github.com/modelscope/FunASR with the latest PyPI release funasr 1.3.9.

It splits ASR into five components and welds them together with a single AutoModel:

ASR – default models: SenseVoice‑Small, Paraformer, Fun‑ASR‑Nano

VAD – default model: fsmn‑vad

Punctuation – default model: ct‑punc

Speaker diarization – default model: cam++

Emotion – default model: emotion2vec+large

Each component can be swapped or upgraded independently, and the combined pipeline can achieve up to 170× real‑time on GPU and 17× on CPU.

FunASR one‑stop pipeline architecture diagram

Key Features

Speed : SenseVoice‑Small processes 1 h audio in < 22 s on GPU (170× real‑time) and 17× real‑time on CPU.

Language coverage : Fun‑ASR‑Nano supports 31 languages; Qwen3‑ASR detects 52 languages automatically; GLM‑ASR‑Nano optimises 17 Chinese dialects.

One‑stop pipeline : VAD, transcription, punctuation, speaker diarization, and emotion are performed in a single call.

Emotion recognition : emotion2vec+large outputs labels such as happy, sad, angry.

Streaming and offline modes : paraformer‑zh‑streaming for WebSocket real‑time subtitles; paraformer‑zh or SenseVoice for offline long audio.

OpenAI‑compatible API : funasr-server --device cuda starts a service exposing /v1/audio/transcriptions identical to Whisper’s API.

Agent integration : built‑in MCP service can attach to Claude/Cursor; OpenAI‑compatible endpoint works with LangChain, Dify, AutoGen.

Installation

pip install funasr

Source installation:

git clone https://github.com/modelscope/FunASR.git
cd FunASR
pip install -e ./

Requirements: Python ≥ 3.8, PyTorch ≥ 1.13, torchaudio.

Start the OpenAI‑compatible server:

pip install funasr fastapi uvicorn python-multipart
funasr-server --model sensevoice --device cuda

Usage Examples

Chinese meeting transcription (VAD + recognition + punctuation + speaker)

from funasr import AutoModel
model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    spk_model="cam++",
    device="cuda",
)
result = model.generate(input="meeting.wav")

Sample output:

[00:00.4 → 00:03.8] Speaker0: 我们今天讨论一下 Q3 的计划
[00:04.2 → 00:07.1] Speaker1: 好的，我有三个要点
[00:07.5 → 00:12.3] Speaker0: 请讲，我们还有 30 分钟

Multi‑language / dialect with Fun‑ASR‑Nano

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    vad_model="fsmn-vad",
    device="cuda",
)
result = model.generate(input="meeting.wav")

Batch processing with vLLM acceleration

from funasr.auto.auto_model_vllm import AutoModelVLLM
model = AutoModelVLLM(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    tensor_parallel_size=1,
)
results = model.generate(["audio1.wav", "audio2.wav"], language="auto")

Streaming real‑time transcription

model = AutoModel(model="paraformer-zh-streaming", device="cuda")
result = model.generate(
    input="chunk.wav",
    cache={},
    chunk_size=[0, 10, 5],
)

chunk_size=[0, 10, 5]

is a typical latency/look‑ahead configuration for live subtitles.

Emotion detection

model = AutoModel(model="emotion2vec_plus_large", device="cuda")
result = model.generate(input="audio.wav", granularity="utterance")

Result returns an emotion label such as happy, sad, angry.

Benchmark Data

Official benchmark (source: modelscope.github.io/FunASR/zh/benchmark.html):

SenseVoice‑Small: 170× GPU, 17× CPU, 13× faster than Whisper‑large‑v3.

Paraformer‑Large: 120× GPU, 15× CPU, 9× faster than Whisper.

Whisper‑large‑v3‑turbo: 46× GPU, cannot run on CPU.

Fun‑ASR‑Nano: 17× GPU, 3.6× CPU, 1.3× faster than Whisper.

For a one‑hour meeting audio, Whisper‑large‑v3 needs 4.6 minutes, while SenseVoice‑Small finishes in 21 seconds (13× speed gap). On CPU, SenseVoice‑Small’s 17× real‑time exceeds Whisper‑large‑v3’s 13× real‑time on GPU.

Pros

One‑stop pipeline reduces the need to stitch multiple repositories.

Best Chinese performance: Paraformer series benefits from eight years of Alibaba DAMO research, surpassing Whisper on dialects, accents, and noise robustness.

CPU‑friendly: works on servers without GPUs.

Service‑ready: funasr‑server provides an OpenAI‑compatible API, allowing existing Whisper SDKs to be reused.

Agent integration: MCP service, OpenAI API, and Gradio demo are all provided.

Cons

Many model choices can be confusing for newcomers.

SenseVoice‑Small is fast but only 234 M parameters; Whisper‑large (1.55 B) still has a slight edge on complex English audio.

Fun‑ASR‑Nano achieves best speed with vLLM, which has a non‑trivial installation overhead.

Documentation mixes Chinese and English; some API parameters are only described in example scripts.

Deployment Recommendations

Chinese meeting transcription – use Paraformer‑zh + cam++ + ct‑punc.

Multi‑language / Chinese dialect – use Fun‑ASR‑Nano (800 M, 31 languages with dialects).

Global 52 languages – use Qwen3‑ASR (1.7 B) with automatic language detection.

Live streaming subtitles – use paraformer‑zh‑streaming (WebSocket).

Emotion analysis / customer‑service QA – use emotion2vec+large (stand‑alone).

Servers without GPUs – use SenseVoice‑Small on CPU (17× real‑time).

Migration from Whisper – run funasr‑server --model sensevoice to expose a Whisper‑compatible endpoint.

Conclusion

FunASR industrialises the entire speech pipeline, adding dialect support, streaming, speaker diarization, emotion detection, and Agent integration under an MIT license. For Chinese‑focused applications such as meeting minutes, customer‑service QA, or live subtitles, FunASR provides a ready‑to‑use solution with significant speed advantages over Whisper, especially on CPU‑only servers. For pure English or GPU‑only offline workloads, Whisper remains a viable alternative.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

open-source multilingual speech recognition CPU performance ASR FunASR Whisper comparison

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Key Features

Installation

Usage Examples

Chinese meeting transcription (VAD + recognition + punctuation + speaker)

Multi‑language / dialect with Fun‑ASR‑Nano

Batch processing with vLLM acceleration

Streaming real‑time transcription

Emotion detection

Benchmark Data

Pros

Cons

Deployment Recommendations

Conclusion

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Chinese meeting transcription (VAD + recognition + punctuation + speaker)