How Claude’s New Auto‑Caching Cuts API Token Costs by 90%
By adding a single field to Claude API requests, developers can automatically cache static prompt parts, reducing token billing to just 10% of the original cost and dramatically lowering expenses for multi‑turn AI agents.
With a single change—adding a cache_control field—Claude API can automatically cache the static parts of a prompt (system instructions, tool definitions, and context) and reuse them across turns, cutting the cost of those tokens to only 10% of the normal price.
How Prompt Caching Works
When a request is sent to Claude, the model processes the prompt in two phases: the prefill phase, which reads every token of the full prompt, and the decode phase, which generates the response token by token. Prefill is the expensive part.
If successive requests share a common prefix, the prefilling of that prefix can be skipped because the result can be hashed and reused.
Prefix Matching
Claude creates a cryptographic hash of the request up to the point where the cache_control marker appears. On the next request, if the prefix is identical, the cached result is used and only 10% of the standard token price is charged for those tokens.
Any change—different timestamps, reordered tools, extra spaces—produces a different hash and breaks the cache.
Concrete Token Calculation
Consider a code‑review agent with a static context of 15,000 tokens (system prompt, tool definitions, project files). Without caching, 40 turns would cost 600,000 input tokens at full price.
15,000 × 40 turns = 600,000 tokens billed at full priceWith auto‑caching, the first turn writes the cache (full price) and the remaining 39 turns hit the cache, costing only 10% of the static tokens.
Turn 1: 15,000 tokens at full price (cache write)
Turns 2–40: 15,000 tokens × 39 turns × 10% = 58,500 tokens
Total static token cost: 73,500 tokensThis reduction can be the difference between a financially viable product and an unsustainable one when hundreds of concurrent users run similar sessions.
Manual vs. Automatic Caching
Before auto‑caching, developers had to insert cache_control breakpoints manually, telling Claude where to cache and how far back to search for a matching hash. This required moving breakpoints forward as the conversation grew, and missing a turn could cause a full‑price miss.
Automatic caching removes that burden: adding a top‑level cache_control field lets the API find the longest matching prefix, move the cache breakpoint automatically, and reuse it in subsequent turns.
Session Example
Turn 1 : API writes the cache for system prompt, tools, and context.
Turn 2 : API hits the cache; only new user messages are billed at full price.
Turns 10, 20, 40 : Same pattern—cached prefix reused, only new tokens charged.
Keeping static content at the beginning of each request ensures every session starts with a cache hit, while dynamic content appears later and is billed normally.
Best‑Practice Rules
Order prompts from static to dynamic: system instruction → tool definitions → project context → session state → conversation messages.
Never change tools mid‑session; altering them changes the cache prefix and forces a full rebuild.
When updates are needed, add them as <system‑reminder> tags in the next user or tool message instead of editing the system prompt.
Avoid switching models mid‑conversation; each model has its own cache, and a switch discards the existing cache.
For long‑running agents that hit the context window limit, use cache‑safe forking: keep the same system prompt, tools, and context, prepend the full conversation history, and add the compression prompt as the final user message.
When Auto‑Caching Is Most Beneficial
It shines when static context exceeds ~1,000 tokens and sessions run many turns. Typical high‑impact scenarios include programming agents, document‑analysis pipelines, customer‑support bots, and research assistants. For short sessions with minimal static context, the benefit is limited.
Auto‑caching is now a built‑in feature of the Claude API, turning a complex manual optimization into a default capability for anyone building multi‑turn agents.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
