How Prompt Caching Supercharges Long‑Running AI Agents: 5 Practical Lessons

This article explains how Claude Code’s Prompt Caching technique dramatically reduces latency and cost for long‑running AI agents, and shares five hard‑won engineering practices—including prompt layout, message‑based updates, avoiding mid‑conversation model or tool changes, and safe context forking—to help developers build efficient, cache‑friendly AI applications.

AI Code to Success
AI Code to Success
AI Code to Success
How Prompt Caching Supercharges Long‑Running AI Agents: 5 Practical Lessons

In engineering, the rule "cache rules everything" also applies to AI agents. Claude Code demonstrates that Prompt Caching—reusing computation from previous turns—significantly cuts latency and cost for long‑running agents.

Prompt Caching works by prefix matching; any change after the shared prefix invalidates the cache, so designing the entire system around this constraint is essential. High cache‑hit rates lower costs and allow more generous rate limits, so Claude Code monitors hit rates and triggers alerts when they drop.

Layout Your Prompts for Caching

Prompt Caching uses prefix matching—everything from the request start to each cache_control breakpoint is cached. Therefore, place static content first and dynamic content later to maximize shared prefixes.

Static system prompts and tools (global cache)

Claude.MD (project‑level cache)

Conversation context (in‑session cache)

Dialogue messages

This ordering maximizes the number of sessions that share a cache hit, but it can be surprisingly fragile; inserting timestamps, nondeterministic tool ordering, or updating tool parameters can break the cache.

Update via Message Passing

Stale information (e.g., timestamps or user‑edited files) may need updating, but changing the prompt can cause a cache miss and increase cost. Instead, embed updates in the next user or tool message using a <system-reminder> tag.

Never Switch Models Mid‑Conversation

Cache entries are model‑specific, making model switches counter‑intuitive. For example, after 100 k tokens with Opus, asking a simple question with Haiku can be more expensive because the cache must be rebuilt. Use sub‑agents: let Opus hand off a "transition" message to another model.

Never Add or Remove Tools Mid‑Conversation

Changing the tool set during a dialogue is a common way to break the cache because tools are part of the cache prefix.

Planning Mode – Design Around Cache

Instead of swapping tools, keep all tools present and treat mode switches as tool calls. Define EnterPlanMode and ExitPlanMode as tools; when a user enters planning mode, the agent receives a system message explaining the mode and the allowed actions, while the tool definitions remain unchanged.

This also lets the model autonomously enter planning mode when needed, without breaking the cache.

Tool Search – Defer Loading

Claude Code may have dozens of tools. Deleting them mid‑conversation invalidates the cache, so the solution is to send lightweight stubs with defer_loading: true. The model loads the full tool only when it selects one, keeping the cache prefix stable.

Forked Context – Compaction

Compaction occurs when the context window is exhausted: the conversation is summarized and continued in a new session. If the summarization call uses a different system prompt or omits tools, the cache prefix no longer matches, forcing a full‑price recomputation.

Solution – Cache‑Safe Fork

When performing compaction, reuse the exact system prompt, user context, system context, and tool definitions from the parent conversation. Prepend the parent messages, then append the compaction prompt as a new user message. From the API perspective, the request looks almost identical to the parent’s last request, so the cache prefix is reused; only the compaction prompt tokens are new.

This requires maintaining a "compaction buffer" to ensure enough space for the new prompt and its summary output.

Key Takeaways

Prompt caching is prefix matching. Any change anywhere in the prefix invalidates later content. Design the whole system around this constraint.

Use messages, not system prompts, for updates. Insert date changes or mode switches as dialogue messages.

Never change tools or models mid‑conversation. Simulate state transitions with tools and use sub‑agents for model switches; defer tool loading instead of deleting.

Monitor cache‑hit rates like uptime. Small percentage changes can dramatically affect cost and latency.

Forked operations must share the parent’s prefix. Use identical cache‑safe parameters for side‑computations such as compaction or summarization.

Claude Code was built around Prompt Caching from day one; if you are building an agent, you should do the same.
large language modelssystem designcost optimizationContext Managementprompt caching
AI Code to Success
Written by

AI Code to Success

Focused on hardcore practical AI technologies (OpenClaw, ClaudeCode, LLMs, etc.) and HarmonyOS development. No hype—just real-world tips, pitfall chronicles, and productivity tools. Follow to transform workflows with code.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.