Decoding Prompt Caching: From PagedAttention Mechanics to Cost‑Saving Practices

The article explains how Prompt Caching leverages vLLM's PagedAttention and block‑level hashing to reuse KV cache across identical prefixes, dramatically cutting LLM inference latency and cost, and provides concrete engineering tips for maximizing cache hit rates.

Shi's AI Notebook
Shi's AI Notebook
Shi's AI Notebook
Decoding Prompt Caching: From PagedAttention Mechanics to Cost‑Saving Practices

Why Cache?

LLM inference consists of a compute‑heavy prefill stage that builds a KV cache for all input tokens, followed by a memory‑bandwidth‑intensive decode stage that reads the KV cache token by token. Without caching, even when 90% of the prompt repeats, the prefill computation must be redone for every request, inflating latency and cost. Prompt Caching skips the prefill for repeated prefixes, reusing the existing KV cache; for example, Anthropic Claude shows cached token pricing at roughly 10% of the original cost with noticeably faster first‑token generation.

Technical Deep Dive: PagedAttention

Traditional KV cache allocation reserves a large contiguous block of GPU memory per request, leading to severe fragmentation. vLLM adopts an operating‑system‑style paging approach— PagedAttention —splitting the KV cache into fixed‑size blocks (typically 16 tokens each) that can reside non‑contiguously in GPU memory, enabling efficient memory management and laying the groundwork for cache reuse.

Prompt Caching (also called prefix caching) builds on PagedAttention by hashing each block. If a block’s hash matches a previously computed block, the system reuses the stored KV vectors.

Hash Chain (Block Hashing) – The Key to Cache Hits

Each block’s hash depends on its token IDs and the hash of the preceding block, forming a chain similar to Git commits. Any change in earlier blocks invalidates all downstream hashes, meaning only fully matching prefixes can be cached; partial matches are insufficient.

When a new request arrives, its block hashes are computed and looked up in a global hash table. A hit points to existing GPU blocks, bypassing the prefill; a miss triggers normal computation and stores the new block hashes.

Common Misunderstanding: Cache Is Global, Not Private

The cache is content‑based, not tied to a user session. Identical system prompts or tool definitions can be reused across different users. In self‑hosted vLLM deployments this global reuse yields substantial savings, whereas cloud APIs may impose TTLs, minimum token thresholds, or require explicit cache breakpoints.

Practical Tips for Maximizing Cache Hit Rate

Keep Prefix Stable (Stable Prefix)

Place all static content—system prompts, tool definitions, example texts—at the very beginning of the prompt. Avoid putting variable data such as timestamps or usernames at the start, as that changes the first block’s hash and breaks the entire chain.

Deterministic Serialization

When passing JSON data (e.g., for tool calls), ensure a deterministic key order. In Python use json.dumps(..., sort_keys=True). Even though {"a":1, "b":2} and {"b":2, "a":1} are semantically identical, their byte representations differ, causing cache misses.

Append‑Only History

When maintaining multi‑turn dialogue, only append new content at the end. Modifying or truncating earlier history changes the hash chain and invalidates subsequent cached blocks.

Watch Tool Definition Changes

Tool definitions are usually concatenated near the system prompt. Dynamically enabling or disabling tools per user alters the prefix, breaking cache hits. Mitigate by sorting and fixing tool definitions or maintaining separate prefix segments for different tool sets.

Summary

Prompt Caching’s essence is reusing computation results. By permanently placing immutable elements (system instructions, background documents, tool lists) at the front and keeping mutable elements (user queries, dynamic variables) at the back, developers can achieve global cache reuse, dramatically reducing redundant computation and cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMcost optimizationHashingLLM Inferencekv cachePrompt CachingPagedAttention
Shi's AI Notebook
Written by

Shi's AI Notebook

AI technology observer documenting AI evolution and industry news, sharing development practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.