Artificial Intelligence 15 min read

How Claude’s New Prompt Caching Cuts Token Costs by 90% for Long‑Running Agents

Claude’s API now automatically caches static parts of prompts—system instructions, tool definitions, and context—so repeated calls reuse these sections at only 10% of the standard token price, dramatically reducing costs for multi‑turn agents, but developers must manage prefixes and avoid cache‑breaking changes.

Code Mala Tang

Mar 9, 2026

How Claude’s New Prompt Caching Cuts Token Costs by 90% for Long‑Running Agents

How Prompt Caching Works

Claude processes each request in two phases:

Prefill phase – the model reads the entire prompt (system instruction, tool definitions, conversation history) and computes every token before generating a response token.

Decode phase – the model generates the response token‑by‑token.

Prefill is the expensive part.

If a new request shares the same prefix as a previous one, the prefill computation can be skipped and only 10% of the standard input‑token price is charged.

Prefix Matching

When a block is marked with the cache_control token, the API creates a cryptographic hash of everything from the start of the request up to that point. On a subsequent request, if the prefix is identical, the hash is found, the prefill is bypassed, and the cost is reduced to 10% of the normal rate.

Any character difference – a timestamp, reordered tool, extra space – changes the hash and breaks the cache.

Token‑Math Example

Consider a code‑review agent with the following static context:

System prompt (instructions & role): 8,000 tokens

Tool definitions (read file, search, run tests): 4,000 tokens

Project context (CLAUDE.md, coding standards): 3,000 tokens

Total static context: 15,000 tokens .

Running a 40‑round review session:

Without cache:
15,000 tokens × 40 rounds = 600,000 input tokens billed at full price

With cache:
Round 1: 15,000 tokens at full price (cache write)
Rounds 2‑40: 15,000 tokens × 39 rounds × 10% = 58,500 tokens billed
Total static token cost: 15,000 + 58,500 = 73,500 tokens

This reduces the cost from 600,000 tokens to 73,500 tokens for the static portion of a single session.

Manual Cache Breakpoints

Before automatic caching, developers inserted a cache_control breakpoint to indicate where Claude should cache.

{
  "messages": [
    { "role": "user", "content": "Review this file" },
    { "role": "assistant", "content": "Here are my findings..." },
    {
      "role": "user",
      "content": "Now run tests",
      "cache_control": { "type": "ephemeral" }
    }
  ]
}

Claude caches everything up to and including that block.

On the next request it searches up to 20 previous blocks for a matching hash.

Missing a round or moving the breakpoint incorrectly forces a full‑price prefill.

Automatic Caching

Automatic caching requires only a top‑level cache_control field.

{
  "cache_control": { "type": "ephemeral" },
  "messages": [
    { "role": "user", "content": "Review this file" },
    { "role": "assistant", "content": "Here are my findings..." },
    { "role": "user", "content": "Now run tests" }
  ]
}

The API finds the longest matching prefix, moves the cache breakpoint to the last cacheable block, and reuses it in subsequent rounds.

Session Example Breakdown

Round 1 – cache is written for system prompt, tools, and static context.

Round 2 – cache hit; only new messages are billed at full price.

Rounds 10, 20, 40 – same pattern; only newly added tokens incur full cost.

Long‑running coding agents and document‑analysis pipelines become economically feasible because static parts can represent 60‑70 % of total token usage.

Compatibility with Manual Breakpoints

Automatic caching does not eliminate fine‑grained control. You can still place explicit cache_control blocks for sections that need separate caching (e.g., a globally shared system prompt while keeping per‑project context separate).

Rules for Keeping the Cache Active

Claude’s cache is prefix‑based, so request ordering determines cache effectiveness. The recommended ordering is:

Base system instruction – global cache across all sessions

Tool definitions – global cache across all sessions

Project‑specific files (e.g., CLAUDE.md) – cache per project

Session state – cache per session

Conversation messages – grow each round

Static content at the start guarantees a cache hit for every session; dynamic content later means only new tokens are billed at full price.

Common patterns that break the prefix:

Embedded timestamps in the system prompt

Non‑deterministic loading order of tool definitions

Parameters that vary slightly between sessions

Do Not Change Tools Mid‑Session

Adding or removing tools changes the prefix, wiping the cache and requiring a full rebuild. The Claude team avoids this by keeping tool definitions constant and using mode‑switching tools such as EnterPlanMode and ExitPlanMode to signal changes via system messages.

Use System Messages Instead of Editing the Prompt

When session data changes (file updates, timestamps, user preferences), embed the update in a <system‑reminder> tag that is passed as part of the next user message or tool result. The model reads the tag, understands the update, and the cache prefix remains unchanged.

Avoid Switching Models Within a Session

Cache entries are model‑specific. Switching from Opus to Haiku discards the existing cache and forces a full‑price rebuild of the entire context. If a sub‑task requires a cheaper model, use a sub‑agent: let the primary model hand off a concise context to the secondary model while preserving the main session cache.

Compression (Safe Fork)

Long‑running agents eventually hit the context‑window limit and must summarize the conversation. A naïve compression call with a different system prompt breaks the cache. The safe‑fork pattern preserves caching:

Use the exact same system prompt, tools, and static context as the parent conversation.

Prepend the full conversation history.

Append the compressed summary as the final user message.

The request is almost identical to the parent’s last round, so the cache is hit and only the compressed summary tokens are billed at full price.

When Prompt Caching Is Most Effective

Prompt caching yields the greatest savings when:

The static context exceeds ~1,000 tokens.

The session runs many rounds (e.g., >5 rounds).

Typical use cases include:

Coding agents with large system prompts, many tool definitions, and multi‑round interactions.

Document‑analysis pipelines that repeatedly use the same instructions and context.

Customer‑support bots that rely on extensive, unchanging knowledge bases.

Research assistants that read, synthesize, and iterate over many rounds.

If static context is small and sessions are short, automatic caching provides little benefit.

Overall, automatic prompt caching in Claude’s API reduces token‑costs for any developer building multi‑round agents, provided that cache‑breaking patterns (changing static order, editing system prompts, swapping tools, or switching models) are avoided.

cost reduction prompt caching Token Optimization LLM engineering automatic caching Claude API

Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.