Claude Code Prompt‑Caching Bug Drained Quotas—Anthropic’s Hotfix and Architecture Reveal
A prompt‑caching bug in Claude Code caused users' quota to deplete rapidly, prompting Anthropic to issue an emergency hotfix in version 2.1.62, reset rate limits, and publicly disclose the core architecture and five counter‑intuitive caching rules for building reliable AI agents.
Around February 26, users on GitHub, Hacker News and Reddit reported that Claude Code’s rate‑limit was being consumed far faster than normal; some premium subscribers hit 90% usage after only a few exchanges. Anthropic engineer Thariq confirmed on February 27 that a bug in the prompt‑caching system was responsible and released a hotfix in version 2.1.62 , resetting the rate limits for all users.
The bug stemmed from the way Claude Code’s prompt cache works: it relies on exact prefix matching. When the cache mistakenly treated a request that should have hit as a brand‑new input, every token after the break was processed at full price, multiplying quota consumption.
While the reset restored limits, some users complained that the blanket reset shifted their subscription renewal date, effectively losing up to 63% of remaining quota—a trade‑off that sparked debate.
On the same day Anthropic published the technical blog post “Lessons from Building Claude Code: Prompt Caching Is Everything” , revealing the system’s architecture. The authors state that cache hit rate is monitored like uptime because a hit reduces token cost to about 10% of the normal price (e.g., Claude Sonnet’s hit cost is a few hundred tokens per million versus 3 USD per million when missed), making it a critical cost and latency factor for large‑scale agents.
Claude Code’s request structure is deliberately layered to maximize cache reuse:
Layer 1 : Global system prompt and tool definitions (global cache).
Layer 2 : Project‑level configuration (project cache).
Layer 3 : Session context such as environment and output format (session cache).
Layer 4 : Per‑turn user messages and tool results (grows with each turn).
Static content is placed before dynamic content so that the longest possible prefix can be cached.
Anthropic also shared five counter‑intuitive rules derived from their experience:
Update information with messages, not by changing the system prompt; use the <system‑reminder> tag.
Avoid switching models mid‑session because caches are model‑specific; if needed, hand off via a sub‑agent.
Never add or remove tools during a session, as tool definitions are part of the cache prefix; instead keep all tools and toggle state with EnterPlanMode and ExitPlanMode.
Defer loading of heavy tools using the defer_loading flag and load full schemas only when required.
When performing context compaction, reuse the exact same system prompt, tool definitions, and conversation history as the parent session, appending only a compaction instruction so the cache remains hit.
The article also references a prior incident on February 14 where a random billing header broke the cache, underscoring that prompt caching is the most fragile yet vital component of AI agents.
For Claude Code users, the immediate actions are to upgrade to version 2.1.62 or later (e.g., claude update or npm install -g @anthropic‑ai/claude‑code@latest), verify the reset rate limit, and consult the Anthropic blog for deeper architectural guidance. Developers building agent products should adopt the five rules to keep cache hit rates high, as cache efficiency is the lifeline of production‑stage AI agents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
