Artificial Intelligence 11 min read

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

GLM‑ASR‑Nano‑2512, a 1.5 B‑parameter open‑source speech‑recognition model released in December 2025, delivers state‑of‑the‑art accuracy on Chinese dialects and low‑volume audio, outperforms Whisper V3 on benchmark tests, runs on consumer GPUs, and provides detailed installation and deployment guides for transformers, vLLM and SGLang.

Old Zhang's AI Learning

Jan 23, 2026

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

Model Overview

GLM‑ASR‑Nano‑2512 released Dec 2025 by Zhipu Z.AI. 1.5 B parameters, small footprint. Official evaluation shows it outperforms OpenAI Whisper V3 on Chinese benchmarks.

Dialect support : optimized for Cantonese and other Chinese dialects; standard ASR models fail when dialects mix with Mandarin.

Low‑volume speech : trained on “whisper” scenarios such as distant speakers, weak telephone recordings, and low‑voice speech in noisy environments.

SOTA performance : average error rate 4.10 % on Wenet Meeting (real‑meeting) and Aishell‑1 (standard Mandarin).

Language coverage : 17 languages with WER ≤ 20 %.

Benchmark

Official benchmark results show GLM‑ASR‑Nano leads across reported metrics.

Comparison with Whisper

Scenarios where GLM‑ASR‑Nano is preferred:

Need to recognize Cantonese, Sichuanese, or other Chinese dialects.

Meeting recordings contain many low‑volume utterances.

Require on‑premises deployment (data never leaves domain).

Plan to fine‑tune for domain‑specific data (medical, legal, finance).

Seek cost‑effective solution without API fees.

Scenarios where Whisper is preferred:

Coverage of 100+ languages.

Mature community ecosystem and extensive documentation.

Built‑in transcribe‑and‑translate capability.

Processing of global accents.

Hardware Requirements

Minimum configuration:

GPU 8 GB+ VRAM (e.g., RTX 3060)

Memory 16 GB

Storage 5 GB for model weights

Production recommendation:

GPU NVIDIA A100, V100 or equivalent

Memory 32 GB+

SSD storage for faster loading

With faster‑whisper optimization, mid‑range GPUs such as a down‑clocked 1080Ti can achieve faster‑than‑real‑time decoding.

Installation

pip install -r requirements.txt
sudo apt install ffmpeg
pip install git+https://github.com/huggingface/transformers   # installs transformers 5.0.0 from source

Basic Usage (Transformers 5.0.0)

from transformers import AutoModel, AutoProcessor
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "zai-org/GLM-ASR-Nano-2512"

processor = AutoProcessor.from_pretrained(repo_id)
model = AutoModel.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)

messages = [{
    "role": "user",
    "content": [
        {"type": "audio", "url": "example_zh.wav"},
        {"type": "text", "text": "Please transcribe this audio into text"}
    ]
}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True))

Service Deployment with vLLM

Upgrade to vLLM 0.14.0 and install matching transformers version.

pip install git+https://github.com/huggingface/transformers
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/GLM-ASR-Nano-2512 \
    --trust-remote-code \
    --dtype bfloat16 \
    --host 0.0.0.0 \
    --port 8000

Client example (OpenAI‑compatible):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(model="GLM-ASR-Nano-2512", file=audio_file)
    print(transcript.text)

Service Deployment with SGLang

docker pull lmsysorg/sglang:dev
pip install git+https://github.com/huggingface/transformers
python -m sglang.launch_server \
    --model-path zai-org/GLM-ASR-Nano-2512 \
    --served-model-name glm-asr \
    --host 0.0.0.0 \
    --port 8000

OpenAI‑compatible call:

from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://127.0.0.1:8000/v1")
response = client.chat.completions.create(
    model="glm-asr",
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": "example_zh.wav"}},
            {"type": "text", "text": "Please transcribe this audio into text"}
        ]
    }],
    max_tokens=1024,
)
print(response.choices[0].message.content.strip())

Batch Inference

from transformers import GlmAsrForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("zai-org/GLM-ASR-Nano-2512")
model = GlmAsrForConditionalGeneration.from_pretrained(
    "zai-org/GLM-ASR-Nano-2512", dtype="auto", device_map="auto"
)

inputs = processor.apply_transcription_request(["audio1.mp3", "audio2.mp3"])
inputs = inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded)

Application Scenarios

Enterprise meeting transcription with mixed dialects and distant speakers.

Call‑center handling regional accents.

Medical record dictation with low‑volume, fast speech.

Media & broadcasting for local TV or online streams.

Edge‑device deployment; 1.5 B parameters run on consumer‑grade GPUs.

Download Links

🤗 Hugging Face: https://huggingface.co/zai-org/GLM-ASR-Nano-2512

🤖 ModelScope: https://modelscope.cn/models/ZhipuAI/GLM-ASR-Nano-2512

GitHub: https://github.com/zai-org/GLM-ASR

Note: Models downloaded before 27 December 2025 must be re‑pulled because the weight format was updated for compatibility with transformers and SGLang.

Advantages & Limitations

Advantages

Strong Cantonese and other dialect recognition.

Effective low‑volume speech handling.

Open‑source, free, supports local deployment and fine‑tuning.

Compatible with major inference frameworks: transformers 5.x, vLLM, SGLang.

Limitations

Language coverage limited to 17 languages (vs 100+ for Whisper).

Community ecosystem still under development.

Requires building transformers from source (5.0.0).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vLLM Open-source speech recognition SGLang Chinese dialects GLM-ASR-Nano-2512 Whisper comparison

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Model Overview

Benchmark

Comparison with Whisper

Hardware Requirements

Installation

Basic Usage (Transformers 5.0.0)

Service Deployment with vLLM

Service Deployment with SGLang

Batch Inference

Application Scenarios

Download Links

Advantages & Limitations

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Basic Usage (Transformers 5.0.0)