Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference
The article explains how KV cache eliminates redundant computations in autoregressive LLM generation, detailing the attention mechanism, the O(n²) waste of recomputing K and V, the cache‑based solution, its impact on time‑to‑first‑token, and the memory‑vs‑speed trade‑off.
1. How LLM Generates Tokens
Transformer first processes the entire input sequence, producing a hidden state for each token. Those hidden states are projected to logits; only the logits of the last token are used for sampling the next token. After sampling, the new token is appended and the process repeats.
2. What Attention Computes
Each Transformer layer creates three vectors for every token: query (Q), key (K), and value (V). Attention multiplies Q with K to obtain scores, then weights V with those scores. For the last token, the attention calculation needs the latest Q together with all previous K and V.
3. Where Redundant Computation Occurs
Generating token 50 requires K and V for tokens 1‑50; token 51 requires K and V for tokens 1‑51. The K and V for tokens 1‑49 have already been computed and never change, yet without caching they are recomputed each step, leading to O(n) work per token and O(n²) overall.
4. How KV Cache Solves the Problem
The solution is to store the already‑computed K and V. For each new token the engine:
Computes Q, K, V only for the newest token.
Appends the new K and V to the cache.
Retrieves the full history of K and V from the cache.
Runs attention using the new Q against the complete cached K and V.
This means each layer adds only one new K and V per step; all other values are read from memory.
5. Why the First Token Is Slow (TTFT)
The initial prompt triggers a “prefill” phase where the model computes K and V for every token in the prompt and fills the cache. This full forward pass dominates the request latency. Once the cache is warm, subsequent tokens require only a single‑token forward, dramatically reducing latency.
The latency of this first token is called time‑to‑first‑token (TTFT). Longer prompts increase TTFT, and various techniques such as chunked prefill, speculative decoding, or prompt caching aim to reduce it.
6. Cost: Saving Compute, Consuming Memory
KV cache trades memory for compute. Storing K and V for every layer and token can consume several gigabytes of GPU memory—for example, a 72‑billion‑parameter model with 80 layers and a 32K context may use multiple GB per request.
High concurrency amplifies this pressure, making KV cache larger than the model weights themselves. Techniques like grouped‑query attention (GQA) and multi‑query attention (MQA) share key/value heads across queries to cut memory usage with limited quality loss.
This memory pressure also explains why doubling the context window is difficult: the cache size grows linearly with context length, reducing the number of concurrent users a server can support.
7. One‑Page Summary
KV cache removes the repeated K and V calculations in autoregressive generation. Historical tokens’ K and V are computed once and cached; each new token only adds its own Q, K, V and re‑uses the full cache for attention. In practice this yields roughly a 5× speedup at the expense of increased GPU memory, a constraint that often becomes the bottleneck in large‑scale LLM services such as vLLM, TGI, and TensorRT‑LLM.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Tech Publishing
In the fast-evolving AI era, we thoroughly explain stable technical foundations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
