How Claude’s New Auto‑Caching Cuts API Token Costs by 90%

By adding a single field to Claude API requests, developers can automatically cache static prompt parts, reducing token billing to just 10% of the original cost and dramatically lowering expenses for multi‑turn AI agents.

AI Waka
AI Waka
AI Waka
How Claude’s New Auto‑Caching Cuts API Token Costs by 90%
Anthropic Automatic Prompt Caching
Anthropic Automatic Prompt Caching

With a single change—adding a cache_control field—Claude API can automatically cache the static parts of a prompt (system instructions, tool definitions, and context) and reuse them across turns, cutting the cost of those tokens to only 10% of the normal price.

How Prompt Caching Works

When a request is sent to Claude, the model processes the prompt in two phases: the prefill phase, which reads every token of the full prompt, and the decode phase, which generates the response token by token. Prefill is the expensive part.

If successive requests share a common prefix, the prefilling of that prefix can be skipped because the result can be hashed and reused.

Prefix Matching

Claude creates a cryptographic hash of the request up to the point where the cache_control marker appears. On the next request, if the prefix is identical, the cached result is used and only 10% of the standard token price is charged for those tokens.

Any change—different timestamps, reordered tools, extra spaces—produces a different hash and breaks the cache.

Concrete Token Calculation

Consider a code‑review agent with a static context of 15,000 tokens (system prompt, tool definitions, project files). Without caching, 40 turns would cost 600,000 input tokens at full price.

15,000 × 40 turns = 600,000 tokens billed at full price

With auto‑caching, the first turn writes the cache (full price) and the remaining 39 turns hit the cache, costing only 10% of the static tokens.

Turn 1: 15,000 tokens at full price (cache write)
Turns 2–40: 15,000 tokens × 39 turns × 10% = 58,500 tokens
Total static token cost: 73,500 tokens

This reduction can be the difference between a financially viable product and an unsustainable one when hundreds of concurrent users run similar sessions.

Manual vs. Automatic Caching

Before auto‑caching, developers had to insert cache_control breakpoints manually, telling Claude where to cache and how far back to search for a matching hash. This required moving breakpoints forward as the conversation grew, and missing a turn could cause a full‑price miss.

Automatic caching removes that burden: adding a top‑level cache_control field lets the API find the longest matching prefix, move the cache breakpoint automatically, and reuse it in subsequent turns.

Session Example

Turn 1 : API writes the cache for system prompt, tools, and context.

Turn 2 : API hits the cache; only new user messages are billed at full price.

Turns 10, 20, 40 : Same pattern—cached prefix reused, only new tokens charged.

Keeping static content at the beginning of each request ensures every session starts with a cache hit, while dynamic content appears later and is billed normally.

Best‑Practice Rules

Order prompts from static to dynamic: system instruction → tool definitions → project context → session state → conversation messages.

Never change tools mid‑session; altering them changes the cache prefix and forces a full rebuild.

When updates are needed, add them as <system‑reminder> tags in the next user or tool message instead of editing the system prompt.

Avoid switching models mid‑conversation; each model has its own cache, and a switch discards the existing cache.

For long‑running agents that hit the context window limit, use cache‑safe forking: keep the same system prompt, tools, and context, prepend the full conversation history, and add the compression prompt as the final user message.

When Auto‑Caching Is Most Beneficial

It shines when static context exceeds ~1,000 tokens and sessions run many turns. Typical high‑impact scenarios include programming agents, document‑analysis pipelines, customer‑support bots, and research assistants. For short sessions with minimal static context, the benefit is limited.

Auto‑caching is now a built‑in feature of the Claude API, turning a complex manual optimization into a default capability for anyone building multi‑turn agents.

AI agentscost reductionprompt cachingToken OptimizationClaude APIauto-caching
AI Waka
Written by

AI Waka

AI changes everything

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.