How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs
The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.
Autoregressive Generation
Large language models generate text token by token. For each step the model reads the entire sequence generated so far, predicts the next token, appends it, and repeats until the response is complete.
Redundant Computation Without KV Cache
During each Transformer attention step the model computes Query (Q), Key (K) and Value (V) vectors. Only the newest token’s Q changes; the K and V for earlier tokens remain identical. Without caching, generating token n requires recomputing K and V for all previous n‑1 tokens, leading to O(n²) work for an n‑token output.
KV Cache Mechanism
Compute Q, K, V only for the newly generated token.
Append the new K and V to a persistent cache.
When attending, fetch all K and V from the cache while using the fresh Q.
This turns the per‑token cost from O(n) to O(1), reducing total complexity from O(n²) to O(n). In practice the speedup is about 5×. Major inference frameworks such as vLLM, TGI and TensorRT‑LLM embed KV Cache as a core feature.
Prefill Phase and TTFT
When a request arrives, the model first processes the entire prompt to populate the KV cache – the “prefill” stage. Prefill is parallelized across the GPU, so the cost per input token is low. After prefill, token generation proceeds serially (each token must wait for the previous one), causing high latency known as Time‑To‑First‑Token (TTFT). Longer prompts increase prefill time and therefore TTFT.
Memory Cost of KV Cache
KV Cache trades compute for GPU memory. For example, Qwen 2.5 72B (80 layers, 32K context, hidden size 8192) consumes several gigabytes of memory per request for the cache; memory usage grows linearly with context length. Two engineering mitigations are:
GQA/MQA : multiple query heads share a single K/V head, reducing memory with minimal quality loss (used in Llama 3, Qwen, Gemma, etc.).
Paged Attention : manages the KV cache in memory pages to avoid fragmentation (core technique of vLLM).
Token Pricing Differences
Regular input token : $3 / 1M tokens – processed in parallel during prefill, high GPU utilization.
Cache write (first‑time cache) : $3.75 / 1M tokens – slightly higher because K/V must be persisted.
Cache hit (read) : $0.30 / 1M tokens – computation is skipped; only a memory read is needed.
Output token : $15 / 1M tokens – generated serially, GPU spends much time idle waiting for the previous token.
The price gap reflects the difference between parallel input processing and serial output generation.
Practical Implication
When using LLM APIs for batch workloads, keep the system prompt constant across requests so that the KV cache can be reused. Prompt caching can reduce API costs by more than 80%.
Citations
Original X post on KV Caching: https://x.com/_avichawla/status/2034902650534187503
Anthropic Prompt Caching documentation: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
ShiZhen AI
Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
