How Claude Code Achieves a 92% Prompt Caching Hit Rate with Three Unbreakable Engineering Rules

Claude Code’s prompt‑caching delivers a 92% hit rate, slashing a 50‑round agent session cost from $6 to $1.15 by separating stable prefixes from dynamic tails, using a three‑layer cache architecture, exact token‑sequence matching, and three strict engineering rules that keep the cache hot and reliable.

AI Architecture Hub
AI Architecture Hub
AI Architecture Hub
How Claude Code Achieves a 92% Prompt Caching Hit Rate with Three Unbreakable Engineering Rules

Claude Code’s team publicly disclosed that their prompt‑caching reaches a 92% hit rate, reducing the cost of a 50‑round code‑generation session from $6 / M tokens to $1.15 / M tokens.

Why caching matters for long‑task agents

Long‑running agents suffer two fatal issues: each round recomputes the entire history, and the growing context dilutes the model’s attention, inflating cost. Prompt caching solves these by cleanly separating a “stable prefix” (unchanging context) from a “dynamic tail” (session‑specific updates).

Three‑layer cache architecture

The cache is split into three layers, each with a defined reuse scope:

Global layer (stable prefix) : system prompts, tool definitions, and shared project documentation (e.g., CLAUDE.md). Reused across projects and sessions, cutting repeated global computation.

Project layer (stable prefix) : project‑specific conventions and documentation. Reused within the same project, avoiding re‑loading per conversation.

Dynamic tail : task history, tool outputs, and the current instruction. Valid only for the current session and grows with interaction, controlling incremental cost.

Cache hit mechanics

Prompt caching reuses the KV tensors generated during the prefill stage of a Transformer. Because a token’s KV vector depends only on preceding tokens, an exact token‑sequence match (tools → system → messages) allows the model to skip the expensive prefill and reuse the stored state. Any change—extra space, reordered sentence, or dynamic content—breaks the match.

Cost breakdown

Prefill accounts for >80% of inference cost; decode is cheap. Cache write costs 1.25× a normal prefill, while cache read costs only 0.1×. Example: a 20 k‑token stable prefix repeated over 50 rounds would cost ~ $6 without caching; with a single write and 49 reads the cost drops to $1.15, an 81% saving.

Practical scenario: 30‑minute session

Minute 0 (cold start) loads the stable prefix (≈20 k tokens) and writes it once. Minutes 1‑5 see cache reads, cutting cost by 90%. Subsequent minutes maintain >90% hit rate as sub‑agents output only summaries, keeping the dynamic tail small.

Three immutable engineering rules

Order stability : strictly follow the API‑enforced order tools → system → messages. Changing the order invalidates the cache.

Prefix cleanliness : remove all dynamic noise (timestamps, random IDs, unordered JSON, mid‑session tool changes) from the stable prefix.

State shift : append any state changes (e.g., "entering plan mode") to the next message or tool output, never modifying the stable prefix.

Official pitfalls and settings

Enable automatic caching via the cache_control parameter.

Lookback covers only ~20 content blocks; longer dialogs need manual breakpoints.

Minimum cacheable length differs by model (Opus 4.5+ requires 4096 tokens, Sonnet 4.6 requires 2048 tokens).

Send concurrent requests only after the first response has begun, to avoid duplicate prefill work.

From caching to context engineering

Prompt caching eliminates repeated computation of stable content, while context engineering decides which content should stay in the prompt. Claude Code’s hybrid strategy keeps stable content in cache, delegates lengthy operations to sub‑agents, and avoids feeding everything into the main context.

7‑step checklist to boost hit rate

Draw a prefix‑layer diagram to locate cost hotspots.

Clean prefix noise by stripping dynamic elements.

Trim the prefix; move detailed data into on‑demand Skills.

Control tool density: lightweight list, heavy on‑demand.

Divert long outputs to hooks/sub‑agents.

Monitor cache write/read token counts; alert if hit rate falls below 80%.

Compress commands without altering the stable prefix.

In summary, Claude Code’s 92% cache hit rate is not a lucky trick but the result of disciplined context separation and strict adherence to three engineering rules, turning prompt caching from a “nice‑to‑have” feature into a cost‑saving necessity for high‑performance agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMcost reductionCache Hit RateClaude CodePrompt CachingAgent Engineering
AI Architecture Hub
Written by

AI Architecture Hub

Focused on sharing high-quality AI content and practical implementation, helping people learn with fewer missteps and become stronger through AI.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.