How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference
This article explains how SGLang’s hybrid model design combines Transformer attention with Mamba state‑space layers, introduces a dual‑pool memory architecture and elastic allocation, and presents specialized prefix‑cache and speculative‑decoding techniques that together enable efficient, scalable inference for long‑context large language models.
Introduction
Large‑language‑model (LLM) inference is moving toward longer contexts, multimodal interaction, and agent‑style workloads. Pure Transformer KV‑Cache scales quadratically with sequence length and exhausts GPU memory, while pure Mamba state‑space models (SSMs) have linear compute but limited recall because their states are updated in‑place and cannot be rolled back. Hybrid architectures that interleave attention and SSM layers aim to combine the strengths of both, but they introduce system‑level challenges such as mismatched state granularity and incompatible scheduling.
Hybrid Architecture Design
Dual memory pools : SGLang creates separate memory pools for the token‑granular Transformer KV‑Cache and the request‑granular Mamba SSM states, eliminating fragmentation and out‑of‑memory risk.
State‑snapshot technique : By snapshotting SSM states after each request, the framework restores the rollback capability lost in pure SSM updates, enabling cache reuse and speculative decoding.
Performance impact : Experiments on Qwen3‑Next and other hybrid models show substantial speed‑up when running on SGLang.
Memory Management
Dual‑pool architecture – At service start‑up the total GPU memory is split using the --mamba-full-memory-ratio flag into a fixed‑size Mamba state pool and a KV‑Cache pool. The Mamba pool is managed per request (HybridReqToTokenPool) and reclaimed immediately after the request finishes, while the KV‑Cache pool continues fine‑grained token‑level allocation (HybridLinearKVPool).
Elastic pool – SGLang pre‑allocates an oversized virtual address space and maps physical pages to the pools on demand via CUDA virtual memory. When one pool’s usage drops, its idle pages are unmapped and reassigned to the other pool, keeping total GPU memory usage within a static budget.
Centralised scheduler – A lightweight scheduler monitors pool utilisation, triggers expansion or contraction requests, and performs safe, atomic re‑allocation without restarting the service.
Hybrid Prefix Cache
Traditional prefix‑cache works for token‑granular KV‑Cache but not for SSM states, which are updated in‑place and cannot be truncated. SGLang introduces MambaRadixCache , a hybrid radix‑tree that stores both KV‑Cache pointers and full SSM state snapshots. During lookup the longest matching prefix is found; KV‑Cache can be reused directly, while the SSM state is copied to a new buffer to preserve isolation.
Speculative Decoding Adaptation
Speculative decoding relies on reversible KV‑Cache updates, which SSM layers lack. SGLang allocates an independent Mamba cache slot for each candidate token, forming isolated state sandboxes. When a candidate is validated, its slot’s final state is promoted to the main SSM state, avoiding recomputation. For top‑K > 1 scenarios, parent‑node indices are recorded so the system can trace back and recursively update the appropriate sandbox.
PD‑Separation Extension
The existing PD‑separation architecture is extended with dedicated transmission channels for non‑attention states (e.g., Mamba SSM states). During Prefill, the final SSM state is transferred as a whole to the Decode instance, which has pre‑allocated slots for both KV‑Cache pages and Mamba states, ensuring seamless continuation of inference.
Performance Validation
Running SGLang v0.5.5 on an H200 GPU with the Qwen3‑Next‑80B‑A3B‑Instruct‑FP8 model, prefix‑matching reduced total token‑to‑first‑token time (TTFT) to 57.63 % of the baseline. Speculative decoding benchmarks (batch size = 1) showed throughput improvements:
MTP = 2, top‑k = 1 → 257 tokens/s (average accepted length ≈ 2.71 tokens)
MTP = 3, top‑k = 1 → 307 tokens/s (average accepted length ≈ 3.41 tokens)
MTP = 4, top‑k = 4, 8 draft tokens → 325 tokens/s (average accepted length ≈ 4.23 tokens)
Future Directions
Generalise the MambaRadixTree to support flexible page sizes and deeper integration with Multi‑Token Prediction, Overlap Scheduler, and Branching Position mechanisms.
Integrate Alibaba Cloud Tair KVCache’s HiCache hierarchical cache with SGLang’s hybrid pipeline for higher hit rates in massive data scenarios.
Advance bit‑level deterministic inference to eliminate nondeterministic numerical drift, improving reproducibility and production reliability.
References
SGLang Hybrid Models – https://pytorch.org/blog/hybrid-models-meet-sglang-more-than-full-attention/
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
