Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

Why inference is slow

Large language models generate text autoregressively, one token at a time. For each new token the model computes attention by taking the dot‑product of the query of the new token with the keys of all previous tokens, then weighting the values. Because the keys and values for a token never change, recomputing them for every step costs O(n²) total operations, which is the fundamental bottleneck.

KV Cache: From O(n²) to O(n)

The key insight is to cache the keys and values after they are first computed. During the prefill stage the model processes the entire prompt once and stores K and V for every token. In the subsequent decode stage each step only computes the query for the new token and reuses the cached K/V. For a 9‑token prompt followed by 16 generated tokens the total attention operations drop from 264 (∑_{i=9}^{24} i) to 24, an 11× reduction in raw computation and roughly a 2× wall‑clock speedup on a 1.5 B parameter model.

def _attention_impl(q, k, v, scale, mask):
    # core: softmax(Q @ K^T / sqrt(d_k)) @ V
    scores = torch.matmul(q, k.transpose(-2, -1)) * scale
    scores = scores.masked_fill(~mask, float("-inf"))
    probs = torch.softmax(scores, dim=-1)
    return torch.matmul(probs, v)

The course demonstrates monkey‑patching F.scaled_dot_product_attention with a custom implementation so that every layer runs the rewritten code while producing identical token‑level outputs.

Grouped Query Attention (GQA)

Modern models such as DeepSeek and Llama use GQA, where many query heads share a smaller set of key/value heads (e.g., 64 query heads share 8 KV heads). This reduces the KV cache size by a factor of 8 with negligible accuracy loss.

def simple_causal_attention(query, key, value, **kwargs):
    Dh = query.shape[-1]
    scale = 1.0 / (Dh ** 0.5)
    gqa_group_size = query.shape[1] // key.shape[1]
    key = key.repeat_interleave(gqa_group_size, dim=1)
    value = value.repeat_interleave(gqa_group_size, dim=1)
    # ... continue with standard attention

RadixAttention: Cross‑request caching

KV Cache eliminates duplicate work within a single request, but in Retrieval‑Augmented Generation (RAG) many users ask different questions about the same document, causing the same document tokens to be recomputed for each request. RadixAttention solves this by indexing cached KV pairs in a radix (prefix) tree.

class CacheEntry:
    def __init__(self, token_ids, kv_cache):
        self.token_ids = list(token_ids)
        self.kv_cache = kv_cache

class FlatRadixTree:
    def __init__(self):
        self.entries = []
    def insert(self, token_ids, kv_cache):
        self.entries.append(CacheEntry(token_ids, kv_cache))
    def search(self, token_ids):
        # longest‑prefix match
        best_match, best_len = None, 0
        for entry in self.entries:
            match_len = 0
            for a, b in zip(entry.token_ids, token_ids):
                if a != b:
                    break
                match_len += 1
            if match_len > best_len:
                best_len, best_match = match_len, entry
        return best_match

Production SGLang uses an O(log n) search; the simplified version uses linear scanning for clarity.

Four‑step runtime flow

Traverse : radix.search(token_ids) finds the longest cached prefix.

Reuse : If a match is found, the cached KV tensors are loaded.

Compute : Only the uncached suffix (e.g., the user’s new question) is processed by the model.

Store : The newly computed KV pairs are inserted back into the tree with radix.insert().

Example usage:

radix = FlatRadixTree()
for question in article_questions:
    prompt = construct_prompt(article, question)
    token_ids = tiny_llm.tokenize(prompt)
    prefix_cache = radix.search(token_ids)
    output, cached_req = tiny_llm.generate_with_prefix_cache(
        prompt, max_new_tokens=16, prefix_cache=prefix_cache, temperature=0)
    radix.insert(cached_req.token_ids, cached_req.kv_cache)

Experimental results

Two 2,000‑character technical articles, each with six questions.

Without prefix caching each question re‑processes the whole document.

With RadixAttention the first question incurs a cache miss; the remaining five hit the cache, skipping roughly 97 % of document tokens and achieving a ~2× speedup (≈20 s saved).

In a mixed‑document experiment with 12 randomly ordered prompts from the two articles, only the first request per article missed the cache (2 misses total), yielding an 83 % hit rate. The tree structure allows multiple branches to coexist, so switching documents does not evict other caches.

Core principle: the longer the shared prefix, the greater the acceleration.

Applicable scenarios

RAG systems : cache document context; typical speedup 5‑10×.

Chatbots : cache system prompt + conversation history; typical speedup 3‑5×.

Few‑shot learning : cache example samples; typical speedup 4‑8×.

Code generation : cache repository context; typical speedup 3‑6×.

Takeaways

The KV Cache eliminates redundant computation within a single request, while RadixAttention extends the benefit across requests that share prefixes. Combined, they provide multiplicative performance gains for LLM serving workloads. The hands‑on code (three‑line cache lookup, generation, and insertion) makes the underlying mechanisms transparent, enabling engineers to debug and extend SGLang or other inference engines.

Performance optimizationPythonRAGLLM inferenceSGLangKV cacheRadixAttention
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.