How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive

This article explains the mechanics of prompt‑caching for large language models, breaks down static versus dynamic context, details KV‑cache operation and its pricing, and shows how Claude Code’s 30‑minute programming session reached a 92% cache hit rate that reduced inference costs by 81%, concluding with three production‑grade design rules.

AI Tech Publishing
AI Tech Publishing
AI Tech Publishing
How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive

Case Study: Claude Code Reaches 92% Cache Hit Rate

Every step an AI agent takes sends the full conversation history—including system instructions, tool definitions, and previously processed context—to the LLM, incurring full‑price token charges each round. For long‑running workflows this redundant computation can dominate costs; identical 50‑turn dialogs can cost five times more for some users.

Prompt‑caching eliminates this waste, but to use it effectively you must understand the underlying mechanism.

1. Static vs. Dynamic Context

Before optimizing prompts, identify which parts change and which stay constant:

Static prefix: unchanged across turns – system instructions, tool definitions, project context, behavior rules.

Dynamic suffix: grows each turn – user messages, assistant replies, tool outputs, observations.

Separating these enables caching: the infrastructure stores the mathematical state of the static prefix so later requests with the same prefix can skip recomputation and read from memory.

1.1 How KV Cache Works

During each LLM inference request two phases occur:

Prefill phase: processes the entire input prompt, performing dense matrix multiplications on every token to compute Query, Key, and Value vectors. These vectors for a token depend only on preceding tokens and never change once computed.

Decode phase: generates tokens one by one, mainly reading the cached state; it is memory‑bound rather than compute‑intensive.

Without caching, the Key and Value tensors for the static prefix are discarded after each request, forcing a full recompute on the next request. KV caching persists these tensors on the inference server, indexed by a hash of the token sequence. When a new request arrives with an identical prefix, the hash matches and the tensors are loaded directly, skipping the prefilling computation for that portion.

Consequently, each generated token’s complexity drops from O(n²) to O(n), yielding massive savings when a 20 000‑token prefix is reused over many rounds.

1.2 Economic Accounting

Cache savings only materialize when hit rates stay high. Reading from cache costs 0.1× the base input price (10 % discount per token). Writing to cache costs 1.25× the base price, and retaining a cache for an hour costs 2.0×.

Anthropic’s pricing for Claude models (illustrated in the original diagram) shows that a 20 000‑token static prefix would cost a full‑price million tokens if recomputed each round.

2. Claude Code 30‑Minute Programming Session

Claude Code is designed to keep the cache “hot”. The following timeline shows its billing impact:

0 min: Loads the 20 000‑token system prompt, tool definitions, and project file (CLAUDE.md). This one‑time cost is paid only once.

1–5 min: User issues commands; the Explore Sub‑agent navigates code, runs grep, etc. The static prefix is read from cache (cost ≈ 0 / MTok), while dynamic suffix grows.

6–15 min: The Plan Sub‑agent receives a concise summary instead of raw output to avoid suffix bloat. Cache hit rate climbs above 90 %; each access resets TTL, keeping the cache hot.

16–25 min: Additional modification requests add more tool calls and terminal output, but each round still reads the 20 000‑token prefix from cache.

28 min: Running /cost shows that without caching 2 million tokens at Sonnet 4.5 pricing would be extremely expensive, whereas with caching the single‑task cost is reduced by 81 % (cost factor ≈ 1.15).

This demonstrates “hot cache” in practice: you pay once for the static layer, then read it for free; only the dynamic tail incurs charges.

2.1 Fragility of Hash‑Based Caching

The cache hashes the entire token sequence from the start. Any change— even swapping two elements—produces a different hash, invalidating the whole prefix. Real‑world failures include:

Adding timestamps to system prompts (hash changes each request).

Inconsistent JSON‑schema field ordering in tool definitions.

Modifying AgentTool parameters mid‑session.

These cases motivate three production rules.

Three Design Rules for Reliable Prompt Caching

Do not modify tools during a session; tool definitions are part of the cached prefix, and any addition or removal invalidates the cache.

Do not switch models mid‑session; caches are model‑specific, and changing to a cheaper model forces a full cache rebuild.

Avoid editing the system prompt to update state; instead append reminder tags in the next user message so the prefix remains unchanged.

3. Applying the Rules to Your Own Agents

Whether using Claude Code or building your own agent, follow this prompt ordering:

Top: system instructions and behavior rules (static, never change).

Next: load all tool definitions at once (static, never change).

Then: retrieved context and reference documents (static during the session).

Bottom: dialogue history and tool outputs (dynamic suffix).

On Anthropic’s API, enabling automatic caching moves the cache breakpoint forward as the conversation grows. Without it you must manually track token boundaries, and a boundary error means missing the cache entirely.

When approaching the context limit, use a “cache‑safe fork” – keep the same system prompt, tools, and history, and add a compression instruction as a new message. Only the compression instruction is billed.

Monitoring Cache Health

Check three fields in the API response: cache_creation_input_tokens: number of tokens written to the cache. cache_read_input_tokens: number of tokens read from the cache. input_tokens: tokens processed without caching.

Cache efficiency =

cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens)

. Treat this metric like system uptime and track it over time.

Core Takeaways

Prompt caching is not a toggle; it requires architectural discipline. Structure prompts so static content sits at the top and dynamic content grows at the bottom. The infrastructure hashes and stores the static prefix, giving a 10 % discount on reads. Discipline lies in the details: avoid timestamps in system prompts, keep tool definitions stable, never switch models mid‑session, and never modify the cached prefix.

Claude Code proves the approach scales: 92 % cache hit rate translates to an 81 % cost reduction. Ignoring these design principles means surrendering a large portion of potential profit.

AI agentsLLMcost optimizationClaude CodeKV cacheprompt cachingAnthropic API
AI Tech Publishing
Written by

AI Tech Publishing

In the fast-evolving AI era, we thoroughly explain stable technical foundations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.