Artificial Intelligence 10 min read

Why Prompt Caching Is Critical: Lessons from Building Claude Code

Prompt caching, a prefix‑matching technique that reuses prior LLM interactions, proved essential for Claude Code’s low latency and cost, and the article details counter‑intuitive practices such as arranging static prompts first, updating info via messages, avoiding mid‑session model or tool changes, and ensuring cache‑safe context forks.

AI Architecture Hub

Apr 23, 2026

Why Prompt Caching Is Critical: Lessons from Building Claude Code

Engineers often say “cache rules everything,” and this principle applies equally to LLM agents. Prompt caching—implemented via prefix matching—stores all content from the request start up to each cache‑control breakpoint, making request order crucial for sharing prefixes across calls.

1. Arrange Prompt Content for Cache Efficiency

Static content should be placed before dynamic content. Claude Code organizes prompts as follows:

Global static system prompts and tools (global cache)

Project‑level cache (Claude.MD)

Session context (session‑level cache)

Conversation messages

This maximizes the probability that different sessions share cache hits. However, the structure is fragile; common breakages include inserting timestamps into static prompts, randomizing tool order, or updating tool parameters.

2. Update Information via Messages, Not Prompt Edits

When prompt information expires (e.g., time updates or file changes), editing the prompt invalidates the cache and raises costs. Instead, embed updates in the next user message or tool response using a <system‑reminder> tag (e.g., “Now it is Wednesday”), preserving the cache.

3. Avoid Switching Models Mid‑Conversation

Cache effectiveness is model‑specific. Switching from a high‑usage model (e.g., Opus) to a cheaper one (e.g., Haiku) after extensive interaction forces a full cache rebuild, making the request more expensive. If a model switch is required, use a sub‑agent: let the current model send a “handoff” message and delegate the task to the new model.

4. Never Add or Remove Tools Mid‑Conversation

Changing the tool set breaks the cache because tools are part of the prompt prefix. Instead of removing tools, keep all tools in the request and use a deferred‑loading placeholder (tool name with defer_loading: true). The model loads the full tool schema only when it selects the tool, keeping the prefix stable.

5. Design Features Around Cache Constraints

Features such as a planning mode should be implemented as tools rather than altering the tool list. Entering and exiting planning mode are handled by dedicated tools, while the full tool set remains unchanged, allowing the cache to stay intact.

6. Cache‑Safe Context Forks for Compression

When the context window is exhausted, compress the conversation by summarizing it and starting a new session. To avoid cache loss, the compression request must reuse the exact system prompt, user and system context, and tool definitions of the parent conversation, appending only the compression prompt as new tokens. Reserve a “compression buffer” to ensure enough space for the summary.

7. Key Takeaways

Update information via messages rather than editing system prompts.

Do not change tools or models mid‑conversation; use tools to signal state changes and sub‑agents for model switches.

Monitor cache hit rates like service availability; even a few percent drop can significantly affect cost and latency.

When forking the conversation (e.g., for summarization), reuse the parent’s cache‑safe parameters to hit the existing prefix.

Claude Code was built from the ground up around prompt caching, and the same approach should be applied when developing any LLM‑based agent.

AI engineering cache optimization LLM agents Claude Code prompt caching

Written by

AI Architecture Hub

Focused on sharing high-quality AI content and practical implementation, helping people learn with fewer missteps and become stronger through AI.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.