How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

This article explains the prompt‑caching mechanism used by Claude Code, showing how separating static prefixes from dynamic tails and leveraging KV‑tensor caching reduces the O(n²) complexity of transformer pre‑fill to O(n), achieving a 92% cache hit rate and up to 81% cost savings in long‑running AI agent sessions.

High Availability Architecture
High Availability Architecture
High Availability Architecture
How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

Background: The Context Tax

Every step an AI agent takes incurs a "context tax" because it must reread the entire prompt—including system instructions, tool definitions, and previously loaded project context—before processing new user input, leading to massive redundant token computation.

Prompt Caching Concept

Prompt caching stores the static part of the prompt (the static prefix ) once and reuses it for subsequent requests, while the dynamic tail (user messages, tool outputs, observations) is processed anew each turn.

Static Prefix : system instructions, tool definitions, project context, behavioral rules – unchanged across all turns.

Dynamic Tail : user messages, tool results, terminal observations – varies per request.

The static prefix is cached as a mathematical state (KV tensors) after the transformer’s pre‑fill stage, allowing later requests to skip recomputation of those tokens.

Transformer Mechanics

Stage 1: Prefill

During prefill the model processes the entire input prompt, performing dense matrix multiplications for every token to produce Query, Key, and Value vectors. This stage is compute‑bound and expensive.

Stage 2: Decode

During decode the model generates output tokens one by one, primarily reading the previously computed KV state; this stage is memory‑bound.

Key and Value vectors depend only on preceding tokens, so once they are computed for the static prefix they never change, enabling caching.

Cost Impact

Without caching, a 20,000‑token system prompt run for 50 rounds would waste 1 million tokens on redundant computation. Prompt caching reduces the complexity from O(n²) to O(n), dramatically cutting token usage.

Anthropic’s pricing example:

Cache Reads : only 10% of the base input price (90% discount per cached token).

Cache Writes : 25% more expensive than base input price (small KV‑tensor storage fee).

1‑Hour Cache TTL : costs twice the base price.

High cache‑hit rates are essential for these savings to be worthwhile.

Claude Code Session Walk‑through (30 minutes)

Minute 0 : Load system prompt and tool definitions (≈20 k tokens) – paid once.

Minutes 1‑5 : User asks to review the auth module. The Explore sub‑agent runs grep, adding output to the dynamic tail while the static prefix is served from cache (read cost ≈ $0.30/MTok).

Minutes 6‑15 : Plan sub‑agent receives exploration results, generates a plan, and the user approves it; each iteration reads the cached prefix.

Minutes 16‑25 : Iterative adjustments add more tool calls and terminal output, growing the dynamic tail but still reusing the cached static prefix.

Minute 28 : Execute /cost to view token accounting.

Without caching this session would exceed 2 million tokens (≈ $6.00 at Sonnet 4.5 rates); with caching, >80% of tokens are read from cache at $0.30/MTok, reducing cost by over 80%.

Rules to Avoid Cache Misses

Do not add or remove tools mid‑session; tool definitions are part of the cached prefix.

Never switch models during a session; caches are model‑specific.

Do not modify the static prefix to change state; instead use tags in subsequent user messages.

Guidelines for Building Prompts

Top : System instructions and immutable rules.

Middle : Pre‑load all required tools (no additions/removals).

Later : Static context and documents.

Bottom : Dialogue history and tool outputs (dynamic).

Enable auto‑caching (now supported by Anthropic’s API) to automatically shift cache boundaries forward.

Monitoring Cache Efficiency

cache_creation_input_tokens

: tokens stored in memory. cache_read_input_tokens: tokens read from cache. input_tokens: tokens processed normally.

Cache efficiency score = cache reads ÷ cache creations; treat it like uptime monitoring.

Conclusion

Prompt caching is not a toggle but an architectural discipline that requires a stable static prefix, pre‑loaded tools, and careful session design. Claude Code demonstrates a practical blueprint, achieving 92% cache hit rates and 81% cost reductions, proving that managing the "context tax" can turn AI agent deployments from cost‑inefficient to economically viable.

Diagram of static prefix vs dynamic tail
Diagram of static prefix vs dynamic tail
AI agentsTransformerClaudeLLM optimizationcost reductionprompt caching
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.