Why Prompt Caching Is Critical: Lessons from Building Claude Code
Prompt caching, a prefix‑matching technique that reuses prior LLM interactions, proved essential for Claude Code’s low latency and cost, and the article details counter‑intuitive practices such as arranging static prompts first, updating info via messages, avoiding mid‑session model or tool changes, and ensuring cache‑safe context forks.
Engineers often say “cache rules everything,” and this principle applies equally to LLM agents. Prompt caching—implemented via prefix matching—stores all content from the request start up to each cache‑control breakpoint, making request order crucial for sharing prefixes across calls.
1. Arrange Prompt Content for Cache Efficiency
Static content should be placed before dynamic content. Claude Code organizes prompts as follows:
Global static system prompts and tools (global cache)
Project‑level cache (Claude.MD)
Session context (session‑level cache)
Conversation messages
This maximizes the probability that different sessions share cache hits. However, the structure is fragile; common breakages include inserting timestamps into static prompts, randomizing tool order, or updating tool parameters.
2. Update Information via Messages, Not Prompt Edits
When prompt information expires (e.g., time updates or file changes), editing the prompt invalidates the cache and raises costs. Instead, embed updates in the next user message or tool response using a <system‑reminder> tag (e.g., “Now it is Wednesday”), preserving the cache.
3. Avoid Switching Models Mid‑Conversation
Cache effectiveness is model‑specific. Switching from a high‑usage model (e.g., Opus) to a cheaper one (e.g., Haiku) after extensive interaction forces a full cache rebuild, making the request more expensive. If a model switch is required, use a sub‑agent: let the current model send a “handoff” message and delegate the task to the new model.
4. Never Add or Remove Tools Mid‑Conversation
Changing the tool set breaks the cache because tools are part of the prompt prefix. Instead of removing tools, keep all tools in the request and use a deferred‑loading placeholder (tool name with defer_loading: true). The model loads the full tool schema only when it selects the tool, keeping the prefix stable.
5. Design Features Around Cache Constraints
Features such as a planning mode should be implemented as tools rather than altering the tool list. Entering and exiting planning mode are handled by dedicated tools, while the full tool set remains unchanged, allowing the cache to stay intact.
6. Cache‑Safe Context Forks for Compression
When the context window is exhausted, compress the conversation by summarizing it and starting a new session. To avoid cache loss, the compression request must reuse the exact system prompt, user and system context, and tool definitions of the parent conversation, appending only the compression prompt as new tokens. Reserve a “compression buffer” to ensure enough space for the summary.
7. Key Takeaways
Update information via messages rather than editing system prompts.
Do not change tools or models mid‑conversation; use tools to signal state changes and sub‑agents for model switches.
Monitor cache hit rates like service availability; even a few percent drop can significantly affect cost and latency.
When forking the conversation (e.g., for summarization), reuse the parent’s cache‑safe parameters to hit the existing prefix.
Claude Code was built from the ground up around prompt caching, and the same approach should be applied when developing any LLM‑based agent.
AI Architecture Hub
Focused on sharing high-quality AI content and practical implementation, helping people learn with fewer missteps and become stronger through AI.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
