Artificial Intelligence 5 min read

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

A developer has re‑implemented the state‑of‑the‑art Qwen3‑ASR model in MLX, enabling native execution on Apple M1‑M4 chips with real‑time factors as low as 0.08, 4‑bit quantization speedups of 4.7×, multilingual support for 52 languages, and features such as word‑level timestamps and streaming transcription.

AI Engineering

Feb 15, 2026

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

Qwen3‑ASR is among the strongest open‑source speech‑recognition models, outperforming Whisper‑large‑v3 in multilingual and noisy environments, but its official PyTorch/CUDA implementation cannot fully exploit the GPU capabilities of Apple chips.

MLX Native Re‑implementation

A developer rebuilt the model from scratch using MLX, rewriting each layer—including interleaved MRoPE and blockwise encoder attention—to run natively on M1, M2, M3, and M4 silicon. This is not a simple wrapper but a full layer‑by‑layer port.

Performance Test

Results on an M4 Pro:

2.5‑second audio : transcription completed in 0.46 s, real‑time factor 0.08.

10‑second audio : 0.83 s, real‑time factor 0.08.

4‑bit quantization : speed increased 4.7×, WER rose slightly from 2.29 % to 2.72 % on LibriSpeech test‑clean.

Multilingual comparison : on the multilingual‑100 set the MLX version achieved 15.99 % WER, marginally better than the official PyTorch version’s 16.69 % WER.

Memory usage: the 0.6 B model occupies ~1.2 GB, the 1.7 B model ~3.4 GB.

Core Features

Dual‑model support : 0.6 B (fast) and 1.7 B (high‑accuracy) versions.

52 languages : 30 core languages plus 22 Chinese dialects.

Word‑level timestamps : native MLX aligner is 2.6× faster than the PyTorch solution.

Quantization support : 4‑bit and 8‑bit modes with controllable quality loss.

Long‑audio handling : single chunk up to 20 minutes.

Speaker separation : experimental offline speaker labeling.

Streaming transcription : real‑time microphone input and stream processing.

Installation & Usage

pip install mlx-qwen3-asr

Basic Python usage:

from mlx_qwen3_asr import transcribe

result = transcribe("audio.wav")
print(result.text)
print(result.language)

Command‑line tool:

mlx-qwen3-asr audio.wav --timestamps -f srt

Technical Details

The project incorporates several optimizations:

Pre‑allocated KV cache with in‑place slice writes and safe pruning.

Grouped‑query fused attention.

Hybrid encoder windowing, delivering a 4.2× speedup for long‑context processing.

Native WAV fast path that bypasses ffmpeg process startup.

Built‑in BPE tokenizer; the inference path does not depend on the transformers library.

All optimizations pass quality‑gate tests to ensure functional parity with the official implementation. The repository contains 441 test cases, each with reproducible JSON benchmark artifacts.

For developers needing high‑quality speech recognition on Apple devices, this implementation offers near‑native performance while preserving the flexibility of an open‑source model, especially advantageous in multilingual and noisy scenarios.

Project URL: https://github.com/moona3k/mlx-qwen3-asr/

Quantization multilingual Speech Recognition Apple Silicon MLX Qwen3-ASR

Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.