vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs
The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.
Framework status (quick overview)
oMLX (Apple Silicon) – released v0.2.21, supports 128K context with 79% KV‑cache reduction.
mlx‑vlm (Apple Silicon) – PR in progress, Metal kernel implementation approaching full‑precision decoding.
llama.cpp (all platforms) – experimental branch compiled, community evaluating TurboQuant support.
vLLM (CUDA) – detailed six‑step integration plan posted, PR pending.
oMLX: TurboQuant KV‑Cache on macOS
oMLX is a macOS‑optimized local LLM inference server with menu‑bar management, batch processing, and a two‑tier KV cache (memory + SSD). TurboQuant KV‑Cache is toggled via the Admin UI.
Prefill uses full fp16 (zero quality loss); the first decode token quantizes the accumulated KV cache into 3‑bit or 4‑bit codebook indices. Decode attention runs on a fused two‑pass Flash‑Attention Metal kernel that reads directly from the packed indices, avoiding de‑quantisation and intermediate fp16 tensors.
KV‑cache memory savings for Qwen3.5‑35B‑A3B (3‑bit TurboQuant):
32K context: 735 MB → 195 MB (73% saved)
64K context: 1 407 MB → 327 MB (77% saved)
128K context: 2 749 MB → 589 MB (79% saved) – zero quality loss
Relative speed compared with fp16 baseline:
Qwen3.5‑35B‑A3B – Prefill 95%, Decode 87%
Qwen3.5‑27B – Prefill 97%, Decode 95%
Installation:
# Install oMLX
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
# Start the service
brew services start omlxThe release also includes oQ+ , which adds GPTQ weight optimisation on top of mixed‑precision quantisation and a batch‑processing acceleration for MoE models. Compressing Qwen3.5‑35B‑A3B (256 experts × 40 layers) takes six minutes, a 15× speedup over sequential processing.
mlx‑vlm: Metal kernels approaching full precision
PR #858 (https://github.com/Blaizzy/mlx-vlm/pull/858) adds a complete TurboQuant inference chain. Five commits introduce the following kernels: _mse_score_kernel – MSE scoring _pack_lowbit_kernel / _unpack_lowbit_kernel – low‑bit pack/unpack _qjl_score_kernel – 1‑bit residual correction _prod_score_kernel – inner‑product calculation scaled_dot_product_attention – adapted for TurboQuant fast‑decode path (single‑query inputs)
Multi‑head optimisation kernels: _prod_score_multi_kernel – multi‑head batch processing _mse_weighted_rot_multi_kernel – weighted rotation multi‑head _prod_score_repeat_kernel – repetition‑mode optimisation
4‑bit PolarQuant path adds: _polar_prod_score_kernel – polar‑coordinate inner product _polar_turbo_score_repeat_kernel – polar‑coordinate repetition optimisation
Decoding speed reaches 70‑85% of full‑precision performance and continues to improve.
llama.cpp: community effort
Issue #20977 (https://github.com/ggml-org/llama.cpp/issues/20977) requests TurboQuant support. Developer @mudler forked a feat/turbo-quant branch (https://github.com/mudler/llama.cpp/tree/feat/turbo-quant) that already compiles and runs; evaluation is ongoing.
vLLM: six‑step integration plan
Issue #38171 (https://github.com/vllm-project/vllm/issues/38171) outlines the following steps:
Extend CacheDType with "turboquant".
Create TurboQuantConfig class using @register_quantization_config decorator.
Implement KV‑Cache method by inheriting BaseKVCacheMethod and registering codebook parameters.
Update quantisation detection so is_quantized_kv_cache() recognises TurboQuant.
Implement CUDA/Triton kernels for encoding (quantised storage) and decoding (attention‑pre‑restore).
Update memory management to accommodate codebook overhead and variable compression rates.
For cloud inference, vLLM + TurboQuant yields a 4‑5× KV‑cache compression, allowing an H100 GPU to serve more concurrent requests and longer contexts.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
