How Claude’s New Prompt Caching Cuts Token Costs by 90% for Long‑Running Agents
Claude’s API now automatically caches static parts of prompts—system instructions, tool definitions, and context—so repeated calls reuse these sections at only 10% of the standard token price, dramatically reducing costs for multi‑turn agents, but developers must manage prefixes and avoid cache‑breaking changes.
How Prompt Caching Works
Claude processes each request in two phases:
Prefill phase – the model reads the entire prompt (system instruction, tool definitions, conversation history) and computes every token before generating a response token.
Decode phase – the model generates the response token‑by‑token.
Prefill is the expensive part.
If a new request shares the same prefix as a previous one, the prefill computation can be skipped and only 10% of the standard input‑token price is charged.
Prefix Matching
When a block is marked with the cache_control token, the API creates a cryptographic hash of everything from the start of the request up to that point. On a subsequent request, if the prefix is identical, the hash is found, the prefill is bypassed, and the cost is reduced to 10% of the normal rate.
Any character difference – a timestamp, reordered tool, extra space – changes the hash and breaks the cache.
Token‑Math Example
Consider a code‑review agent with the following static context:
System prompt (instructions & role): 8,000 tokens
Tool definitions (read file, search, run tests): 4,000 tokens
Project context (CLAUDE.md, coding standards): 3,000 tokens
Total static context: 15,000 tokens .
Running a 40‑round review session:
Without cache:
15,000 tokens × 40 rounds = 600,000 input tokens billed at full price With cache:
Round 1: 15,000 tokens at full price (cache write)
Rounds 2‑40: 15,000 tokens × 39 rounds × 10% = 58,500 tokens billed
Total static token cost: 15,000 + 58,500 = 73,500 tokensThis reduces the cost from 600,000 tokens to 73,500 tokens for the static portion of a single session.
Manual Cache Breakpoints
Before automatic caching, developers inserted a cache_control breakpoint to indicate where Claude should cache.
{
"messages": [
{ "role": "user", "content": "Review this file" },
{ "role": "assistant", "content": "Here are my findings..." },
{
"role": "user",
"content": "Now run tests",
"cache_control": { "type": "ephemeral" }
}
]
}Claude caches everything up to and including that block.
On the next request it searches up to 20 previous blocks for a matching hash.
Missing a round or moving the breakpoint incorrectly forces a full‑price prefill.
Automatic Caching
Automatic caching requires only a top‑level cache_control field.
{
"cache_control": { "type": "ephemeral" },
"messages": [
{ "role": "user", "content": "Review this file" },
{ "role": "assistant", "content": "Here are my findings..." },
{ "role": "user", "content": "Now run tests" }
]
}The API finds the longest matching prefix, moves the cache breakpoint to the last cacheable block, and reuses it in subsequent rounds.
Session Example Breakdown
Round 1 – cache is written for system prompt, tools, and static context.
Round 2 – cache hit; only new messages are billed at full price.
Rounds 10, 20, 40 – same pattern; only newly added tokens incur full cost.
Long‑running coding agents and document‑analysis pipelines become economically feasible because static parts can represent 60‑70 % of total token usage.
Compatibility with Manual Breakpoints
Automatic caching does not eliminate fine‑grained control. You can still place explicit cache_control blocks for sections that need separate caching (e.g., a globally shared system prompt while keeping per‑project context separate).
Rules for Keeping the Cache Active
Claude’s cache is prefix‑based, so request ordering determines cache effectiveness. The recommended ordering is:
Base system instruction – global cache across all sessions
Tool definitions – global cache across all sessions
Project‑specific files (e.g., CLAUDE.md) – cache per project
Session state – cache per session
Conversation messages – grow each round
Static content at the start guarantees a cache hit for every session; dynamic content later means only new tokens are billed at full price.
Common patterns that break the prefix:
Embedded timestamps in the system prompt
Non‑deterministic loading order of tool definitions
Parameters that vary slightly between sessions
Do Not Change Tools Mid‑Session
Adding or removing tools changes the prefix, wiping the cache and requiring a full rebuild. The Claude team avoids this by keeping tool definitions constant and using mode‑switching tools such as EnterPlanMode and ExitPlanMode to signal changes via system messages.
Use System Messages Instead of Editing the Prompt
When session data changes (file updates, timestamps, user preferences), embed the update in a <system‑reminder> tag that is passed as part of the next user message or tool result. The model reads the tag, understands the update, and the cache prefix remains unchanged.
Avoid Switching Models Within a Session
Cache entries are model‑specific. Switching from Opus to Haiku discards the existing cache and forces a full‑price rebuild of the entire context. If a sub‑task requires a cheaper model, use a sub‑agent: let the primary model hand off a concise context to the secondary model while preserving the main session cache.
Compression (Safe Fork)
Long‑running agents eventually hit the context‑window limit and must summarize the conversation. A naïve compression call with a different system prompt breaks the cache. The safe‑fork pattern preserves caching:
Use the exact same system prompt, tools, and static context as the parent conversation.
Prepend the full conversation history.
Append the compressed summary as the final user message.
The request is almost identical to the parent’s last round, so the cache is hit and only the compressed summary tokens are billed at full price.
When Prompt Caching Is Most Effective
Prompt caching yields the greatest savings when:
The static context exceeds ~1,000 tokens.
The session runs many rounds (e.g., >5 rounds).
Typical use cases include:
Coding agents with large system prompts, many tool definitions, and multi‑round interactions.
Document‑analysis pipelines that repeatedly use the same instructions and context.
Customer‑support bots that rely on extensive, unchanging knowledge bases.
Research assistants that read, synthesize, and iterate over many rounds.
If static context is small and sessions are short, automatic caching provides little benefit.
Overall, automatic prompt caching in Claude’s API reduces token‑costs for any developer building multi‑round agents, provided that cache‑breaking patterns (changing static order, editing system prompts, swapping tools, or switching models) are avoided.
Code Mala Tang
Read source code together, write articles together, and enjoy spicy hot pot together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
