How Prompt Caching Supercharges Long‑Running AI Agents: 5 Practical Lessons
This article explains how Claude Code’s Prompt Caching technique dramatically reduces latency and cost for long‑running AI agents, and shares five hard‑won engineering practices—including prompt layout, message‑based updates, avoiding mid‑conversation model or tool changes, and safe context forking—to help developers build efficient, cache‑friendly AI applications.
In engineering, the rule "cache rules everything" also applies to AI agents. Claude Code demonstrates that Prompt Caching—reusing computation from previous turns—significantly cuts latency and cost for long‑running agents.
Prompt Caching works by prefix matching; any change after the shared prefix invalidates the cache, so designing the entire system around this constraint is essential. High cache‑hit rates lower costs and allow more generous rate limits, so Claude Code monitors hit rates and triggers alerts when they drop.
Layout Your Prompts for Caching
Prompt Caching uses prefix matching—everything from the request start to each cache_control breakpoint is cached. Therefore, place static content first and dynamic content later to maximize shared prefixes.
Static system prompts and tools (global cache)
Claude.MD (project‑level cache)
Conversation context (in‑session cache)
Dialogue messages
This ordering maximizes the number of sessions that share a cache hit, but it can be surprisingly fragile; inserting timestamps, nondeterministic tool ordering, or updating tool parameters can break the cache.
Update via Message Passing
Stale information (e.g., timestamps or user‑edited files) may need updating, but changing the prompt can cause a cache miss and increase cost. Instead, embed updates in the next user or tool message using a <system-reminder> tag.
Never Switch Models Mid‑Conversation
Cache entries are model‑specific, making model switches counter‑intuitive. For example, after 100 k tokens with Opus, asking a simple question with Haiku can be more expensive because the cache must be rebuilt. Use sub‑agents: let Opus hand off a "transition" message to another model.
Never Add or Remove Tools Mid‑Conversation
Changing the tool set during a dialogue is a common way to break the cache because tools are part of the cache prefix.
Planning Mode – Design Around Cache
Instead of swapping tools, keep all tools present and treat mode switches as tool calls. Define EnterPlanMode and ExitPlanMode as tools; when a user enters planning mode, the agent receives a system message explaining the mode and the allowed actions, while the tool definitions remain unchanged.
This also lets the model autonomously enter planning mode when needed, without breaking the cache.
Tool Search – Defer Loading
Claude Code may have dozens of tools. Deleting them mid‑conversation invalidates the cache, so the solution is to send lightweight stubs with defer_loading: true. The model loads the full tool only when it selects one, keeping the cache prefix stable.
Forked Context – Compaction
Compaction occurs when the context window is exhausted: the conversation is summarized and continued in a new session. If the summarization call uses a different system prompt or omits tools, the cache prefix no longer matches, forcing a full‑price recomputation.
Solution – Cache‑Safe Fork
When performing compaction, reuse the exact system prompt, user context, system context, and tool definitions from the parent conversation. Prepend the parent messages, then append the compaction prompt as a new user message. From the API perspective, the request looks almost identical to the parent’s last request, so the cache prefix is reused; only the compaction prompt tokens are new.
This requires maintaining a "compaction buffer" to ensure enough space for the new prompt and its summary output.
Key Takeaways
Prompt caching is prefix matching. Any change anywhere in the prefix invalidates later content. Design the whole system around this constraint.
Use messages, not system prompts, for updates. Insert date changes or mode switches as dialogue messages.
Never change tools or models mid‑conversation. Simulate state transitions with tools and use sub‑agents for model switches; defer tool loading instead of deleting.
Monitor cache‑hit rates like uptime. Small percentage changes can dramatically affect cost and latency.
Forked operations must share the parent’s prefix. Use identical cache‑safe parameters for side‑computations such as compaction or summarization.
Claude Code was built around Prompt Caching from day one; if you are building an agent, you should do the same.
AI Code to Success
Focused on hardcore practical AI technologies (OpenClaw, ClaudeCode, LLMs, etc.) and HarmonyOS development. No hype—just real-world tips, pitfall chronicles, and productivity tools. Follow to transform workflows with code.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
