Artificial Intelligence 9 min read

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

ShiZhen AI

Apr 2, 2026

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

Autoregressive Generation

Large language models generate text token by token. For each step the model reads the entire sequence generated so far, predicts the next token, appends it, and repeats until the response is complete.

Redundant Computation Without KV Cache

During each Transformer attention step the model computes Query (Q), Key (K) and Value (V) vectors. Only the newest token’s Q changes; the K and V for earlier tokens remain identical. Without caching, generating token n requires recomputing K and V for all previous n‑1 tokens, leading to O(n²) work for an n‑token output.

KV Cache Mechanism

Compute Q, K, V only for the newly generated token.

Append the new K and V to a persistent cache.

When attending, fetch all K and V from the cache while using the fresh Q.

This turns the per‑token cost from O(n) to O(1), reducing total complexity from O(n²) to O(n). In practice the speedup is about 5×. Major inference frameworks such as vLLM, TGI and TensorRT‑LLM embed KV Cache as a core feature.

Prefill Phase and TTFT

When a request arrives, the model first processes the entire prompt to populate the KV cache – the “prefill” stage. Prefill is parallelized across the GPU, so the cost per input token is low. After prefill, token generation proceeds serially (each token must wait for the previous one), causing high latency known as Time‑To‑First‑Token (TTFT). Longer prompts increase prefill time and therefore TTFT.

Memory Cost of KV Cache

KV Cache trades compute for GPU memory. For example, Qwen 2.5 72B (80 layers, 32K context, hidden size 8192) consumes several gigabytes of memory per request for the cache; memory usage grows linearly with context length. Two engineering mitigations are:

GQA/MQA : multiple query heads share a single K/V head, reducing memory with minimal quality loss (used in Llama 3, Qwen, Gemma, etc.).

Paged Attention : manages the KV cache in memory pages to avoid fragmentation (core technique of vLLM).

Token Pricing Differences

Regular input token : $3 / 1M tokens – processed in parallel during prefill, high GPU utilization.

Cache write (first‑time cache) : $3.75 / 1M tokens – slightly higher because K/V must be persisted.

Cache hit (read) : $0.30 / 1M tokens – computation is skipped; only a memory read is needed.

Output token : $15 / 1M tokens – generated serially, GPU spends much time idle waiting for the previous token.

The price gap reflects the difference between parallel input processing and serial output generation.

Practical Implication

When using LLM APIs for batch workloads, keep the system prompt constant across requests so that the KV cache can be reused. Prompt caching can reduce API costs by more than 80%.

Citations

Original X post on KV Caching: https://x.com/_avichawla/status/2034902650534187503

Anthropic Prompt Caching documentation: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

memory optimization Transformer LLM inference cost analysis Prefill Token Generation KV cache

Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.