8 min read

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.

Old Zhang's AI Learning

Mar 28, 2026

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

Framework status (quick overview)

oMLX (Apple Silicon) – released v0.2.21, supports 128K context with 79% KV‑cache reduction.

mlx‑vlm (Apple Silicon) – PR in progress, Metal kernel implementation approaching full‑precision decoding.

llama.cpp (all platforms) – experimental branch compiled, community evaluating TurboQuant support.

vLLM (CUDA) – detailed six‑step integration plan posted, PR pending.

oMLX: TurboQuant KV‑Cache on macOS

oMLX is a macOS‑optimized local LLM inference server with menu‑bar management, batch processing, and a two‑tier KV cache (memory + SSD). TurboQuant KV‑Cache is toggled via the Admin UI.

Prefill uses full fp16 (zero quality loss); the first decode token quantizes the accumulated KV cache into 3‑bit or 4‑bit codebook indices. Decode attention runs on a fused two‑pass Flash‑Attention Metal kernel that reads directly from the packed indices, avoiding de‑quantisation and intermediate fp16 tensors.

KV‑cache memory savings for Qwen3.5‑35B‑A3B (3‑bit TurboQuant):

32K context: 735 MB → 195 MB (73% saved)

64K context: 1 407 MB → 327 MB (77% saved)

128K context: 2 749 MB → 589 MB (79% saved) – zero quality loss

Relative speed compared with fp16 baseline:

Qwen3.5‑35B‑A3B – Prefill 95%, Decode 87%

Qwen3.5‑27B – Prefill 97%, Decode 95%

Installation:

# Install oMLX
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Start the service
brew services start omlx

The release also includes oQ+ , which adds GPTQ weight optimisation on top of mixed‑precision quantisation and a batch‑processing acceleration for MoE models. Compressing Qwen3.5‑35B‑A3B (256 experts × 40 layers) takes six minutes, a 15× speedup over sequential processing.

mlx‑vlm: Metal kernels approaching full precision

PR #858 (https://github.com/Blaizzy/mlx-vlm/pull/858) adds a complete TurboQuant inference chain. Five commits introduce the following kernels: _mse_score_kernel – MSE scoring _pack_lowbit_kernel / _unpack_lowbit_kernel – low‑bit pack/unpack _qjl_score_kernel – 1‑bit residual correction _prod_score_kernel – inner‑product calculation scaled_dot_product_attention – adapted for TurboQuant fast‑decode path (single‑query inputs)

Multi‑head optimisation kernels: _prod_score_multi_kernel – multi‑head batch processing _mse_weighted_rot_multi_kernel – weighted rotation multi‑head _prod_score_repeat_kernel – repetition‑mode optimisation

4‑bit PolarQuant path adds: _polar_prod_score_kernel – polar‑coordinate inner product _polar_turbo_score_repeat_kernel – polar‑coordinate repetition optimisation

Decoding speed reaches 70‑85% of full‑precision performance and continues to improve.

llama.cpp: community effort

Issue #20977 (https://github.com/ggml-org/llama.cpp/issues/20977) requests TurboQuant support. Developer @mudler forked a feat/turbo-quant branch (https://github.com/mudler/llama.cpp/tree/feat/turbo-quant) that already compiles and runs; evaluation is ongoing.

vLLM: six‑step integration plan

Issue #38171 (https://github.com/vllm-project/vllm/issues/38171) outlines the following steps:

Extend CacheDType with "turboquant".

Create TurboQuantConfig class using @register_quantization_config decorator.

Implement KV‑Cache method by inheriting BaseKVCacheMethod and registering codebook parameters.

Update quantisation detection so is_quantized_kv_cache() recognises TurboQuant.

Implement CUDA/Triton kernels for encoding (quantised storage) and decoding (attention‑pre‑restore).

Update memory management to accommodate codebook overhead and variable compression rates.

For cloud inference, vLLM + TurboQuant yields a 4‑5× KV‑cache compression, allowing an H100 GPU to serve more concurrent requests and longer contexts.

vLLM LLM inference llama.cpp KV cache TurboQuant mlx-vlm oMLX

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.