Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

The article explains how KV cache eliminates redundant computations in autoregressive LLM generation, detailing the attention mechanism, the O(n²) waste of recomputing K and V, the cache‑based solution, its impact on time‑to‑first‑token, and the memory‑vs‑speed trade‑off.

AI Tech Publishing
AI Tech Publishing
AI Tech Publishing
Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

1. How LLM Generates Tokens

Transformer first processes the entire input sequence, producing a hidden state for each token. Those hidden states are projected to logits; only the logits of the last token are used for sampling the next token. After sampling, the new token is appended and the process repeats.

Token generation flow
Token generation flow

2. What Attention Computes

Each Transformer layer creates three vectors for every token: query (Q), key (K), and value (V). Attention multiplies Q with K to obtain scores, then weights V with those scores. For the last token, the attention calculation needs the latest Q together with all previous K and V.

Attention vectors
Attention vectors

3. Where Redundant Computation Occurs

Generating token 50 requires K and V for tokens 1‑50; token 51 requires K and V for tokens 1‑51. The K and V for tokens 1‑49 have already been computed and never change, yet without caching they are recomputed each step, leading to O(n) work per token and O(n²) overall.

Redundant computation diagram
Redundant computation diagram

4. How KV Cache Solves the Problem

The solution is to store the already‑computed K and V. For each new token the engine:

Computes Q, K, V only for the newest token.

Appends the new K and V to the cache.

Retrieves the full history of K and V from the cache.

Runs attention using the new Q against the complete cached K and V.

KV cache workflow
KV cache workflow

This means each layer adds only one new K and V per step; all other values are read from memory.

5. Why the First Token Is Slow (TTFT)

The initial prompt triggers a “prefill” phase where the model computes K and V for every token in the prompt and fills the cache. This full forward pass dominates the request latency. Once the cache is warm, subsequent tokens require only a single‑token forward, dramatically reducing latency.

Prefill phase
Prefill phase

The latency of this first token is called time‑to‑first‑token (TTFT). Longer prompts increase TTFT, and various techniques such as chunked prefill, speculative decoding, or prompt caching aim to reduce it.

6. Cost: Saving Compute, Consuming Memory

KV cache trades memory for compute. Storing K and V for every layer and token can consume several gigabytes of GPU memory—for example, a 72‑billion‑parameter model with 80 layers and a 32K context may use multiple GB per request.

High concurrency amplifies this pressure, making KV cache larger than the model weights themselves. Techniques like grouped‑query attention (GQA) and multi‑query attention (MQA) share key/value heads across queries to cut memory usage with limited quality loss.

This memory pressure also explains why doubling the context window is difficult: the cache size grows linearly with context length, reducing the number of concurrent users a server can support.

7. One‑Page Summary

KV cache removes the repeated K and V calculations in autoregressive generation. Historical tokens’ K and V are computed once and cached; each new token only adds its own Q, K, V and re‑uses the full cache for attention. In practice this yields roughly a 5× speedup at the expense of increased GPU memory, a constraint that often becomes the bottleneck in large‑scale LLM services such as vLLM, TGI, and TensorRT‑LLM.

Speedup illustration
Speedup illustration
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMInference OptimizationattentionKV cachememory tradeofftime-to-first-token
AI Tech Publishing
Written by

AI Tech Publishing

In the fast-evolving AI era, we thoroughly explain stable technical foundations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.