vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

Framework status (quick overview)

oMLX (Apple Silicon) – released v0.2.21, supports 128K context with 79% KV‑cache reduction.

mlx‑vlm (Apple Silicon) – PR in progress, Metal kernel implementation approaching full‑precision decoding.

llama.cpp (all platforms) – experimental branch compiled, community evaluating TurboQuant support.

vLLM (CUDA) – detailed six‑step integration plan posted, PR pending.

oMLX: TurboQuant KV‑Cache on macOS

oMLX is a macOS‑optimized local LLM inference server with menu‑bar management, batch processing, and a two‑tier KV cache (memory + SSD). TurboQuant KV‑Cache is toggled via the Admin UI.

Prefill uses full fp16 (zero quality loss); the first decode token quantizes the accumulated KV cache into 3‑bit or 4‑bit codebook indices. Decode attention runs on a fused two‑pass Flash‑Attention Metal kernel that reads directly from the packed indices, avoiding de‑quantisation and intermediate fp16 tensors.

KV‑cache memory savings for Qwen3.5‑35B‑A3B (3‑bit TurboQuant):

32K context: 735 MB → 195 MB (73% saved)

64K context: 1 407 MB → 327 MB (77% saved)

128K context: 2 749 MB → 589 MB (79% saved) – zero quality loss

Relative speed compared with fp16 baseline:

Qwen3.5‑35B‑A3B – Prefill 95%, Decode 87%

Qwen3.5‑27B – Prefill 97%, Decode 95%

Installation:

# Install oMLX
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Start the service
brew services start omlx

The release also includes oQ+ , which adds GPTQ weight optimisation on top of mixed‑precision quantisation and a batch‑processing acceleration for MoE models. Compressing Qwen3.5‑35B‑A3B (256 experts × 40 layers) takes six minutes, a 15× speedup over sequential processing.

oMLX TurboQuant KV Cache UI
oMLX TurboQuant KV Cache UI

mlx‑vlm: Metal kernels approaching full precision

PR #858 (https://github.com/Blaizzy/mlx-vlm/pull/858) adds a complete TurboQuant inference chain. Five commits introduce the following kernels: _mse_score_kernel – MSE scoring _pack_lowbit_kernel / _unpack_lowbit_kernel – low‑bit pack/unpack _qjl_score_kernel – 1‑bit residual correction _prod_score_kernel – inner‑product calculation scaled_dot_product_attention – adapted for TurboQuant fast‑decode path (single‑query inputs)

Multi‑head optimisation kernels: _prod_score_multi_kernel – multi‑head batch processing _mse_weighted_rot_multi_kernel – weighted rotation multi‑head _prod_score_repeat_kernel – repetition‑mode optimisation

4‑bit PolarQuant path adds: _polar_prod_score_kernel – polar‑coordinate inner product _polar_turbo_score_repeat_kernel – polar‑coordinate repetition optimisation

Decoding speed reaches 70‑85% of full‑precision performance and continues to improve.

llama.cpp: community effort

Issue #20977 (https://github.com/ggml-org/llama.cpp/issues/20977) requests TurboQuant support. Developer @mudler forked a feat/turbo-quant branch (https://github.com/mudler/llama.cpp/tree/feat/turbo-quant) that already compiles and runs; evaluation is ongoing.

vLLM: six‑step integration plan

Issue #38171 (https://github.com/vllm-project/vllm/issues/38171) outlines the following steps:

Extend CacheDType with "turboquant".

Create TurboQuantConfig class using @register_quantization_config decorator.

Implement KV‑Cache method by inheriting BaseKVCacheMethod and registering codebook parameters.

Update quantisation detection so is_quantized_kv_cache() recognises TurboQuant.

Implement CUDA/Triton kernels for encoding (quantised storage) and decoding (attention‑pre‑restore).

Update memory management to accommodate codebook overhead and variable compression rates.

For cloud inference, vLLM + TurboQuant yields a 4‑5× KV‑cache compression, allowing an H100 GPU to serve more concurrent requests and longer contexts.

vLLMLLM inferencellama.cppKV cacheTurboQuantmlx-vlmoMLX
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.