Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition
A developer has re‑implemented the state‑of‑the‑art Qwen3‑ASR model in MLX, enabling native execution on Apple M1‑M4 chips with real‑time factors as low as 0.08, 4‑bit quantization speedups of 4.7×, multilingual support for 52 languages, and features such as word‑level timestamps and streaming transcription.
Qwen3‑ASR is among the strongest open‑source speech‑recognition models, outperforming Whisper‑large‑v3 in multilingual and noisy environments, but its official PyTorch/CUDA implementation cannot fully exploit the GPU capabilities of Apple chips.
MLX Native Re‑implementation
A developer rebuilt the model from scratch using MLX, rewriting each layer—including interleaved MRoPE and blockwise encoder attention—to run natively on M1, M2, M3, and M4 silicon. This is not a simple wrapper but a full layer‑by‑layer port.
Performance Test
Results on an M4 Pro:
2.5‑second audio : transcription completed in 0.46 s, real‑time factor 0.08.
10‑second audio : 0.83 s, real‑time factor 0.08.
4‑bit quantization : speed increased 4.7×, WER rose slightly from 2.29 % to 2.72 % on LibriSpeech test‑clean.
Multilingual comparison : on the multilingual‑100 set the MLX version achieved 15.99 % WER, marginally better than the official PyTorch version’s 16.69 % WER.
Memory usage: the 0.6 B model occupies ~1.2 GB, the 1.7 B model ~3.4 GB.
Core Features
Dual‑model support : 0.6 B (fast) and 1.7 B (high‑accuracy) versions.
52 languages : 30 core languages plus 22 Chinese dialects.
Word‑level timestamps : native MLX aligner is 2.6× faster than the PyTorch solution.
Quantization support : 4‑bit and 8‑bit modes with controllable quality loss.
Long‑audio handling : single chunk up to 20 minutes.
Speaker separation : experimental offline speaker labeling.
Streaming transcription : real‑time microphone input and stream processing.
Installation & Usage
pip install mlx-qwen3-asrBasic Python usage:
from mlx_qwen3_asr import transcribe
result = transcribe("audio.wav")
print(result.text)
print(result.language)Command‑line tool:
mlx-qwen3-asr audio.wav --timestamps -f srtTechnical Details
The project incorporates several optimizations:
Pre‑allocated KV cache with in‑place slice writes and safe pruning.
Grouped‑query fused attention.
Hybrid encoder windowing, delivering a 4.2× speedup for long‑context processing.
Native WAV fast path that bypasses ffmpeg process startup.
Built‑in BPE tokenizer; the inference path does not depend on the transformers library.
All optimizations pass quality‑gate tests to ensure functional parity with the official implementation. The repository contains 441 test cases, each with reproducible JSON benchmark artifacts.
For developers needing high‑quality speech recognition on Apple devices, this implementation offers near‑native performance while preserving the flexibility of an open‑source model, especially advantageous in multilingual and noisy scenarios.
Project URL: https://github.com/moona3k/mlx-qwen3-asr/
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
