Prompt Caching, Tool Design, and Agent Architecture: Insights from Claude Code
The article explains LLM inference stages, how KV‑cache and vLLM's Paged Attention enable cross‑request prompt caching, and shares practical guidelines for prompt ordering, immutable caching, and robust tool design that together shape efficient and reliable AI agent architectures.
1. How LLM Inference Works
LLM inference consists of two stages. Prefill processes the entire input prompt, computing Query, Key, and Value vectors for every token and producing the first output token; this step is compute‑intensive. Decode then generates tokens autoregressively, reading the KV cache from GPU memory each step; the computation per token is tiny but memory bandwidth becomes the bottleneck.
For programming agents the prompt can be very long (system prompt, tool definitions, conversation history, code files) while the output is short, so Prefill dominates the cost. Without a KV cache, generating each new token requires recomputing K/V for all previous tokens, giving O(n²) complexity. With a cache, only the new token’s K/V are computed and appended, reducing the per‑step cost to O(1).
Prefill: [The capital of France is]
→ compute and store K/V for all tokens
Decode step 1:
input: only [Paris]
cache: K/V for [The capital of France is] + new [Paris]
output: which
Decode step 2:
input: only [which]
cache: K/V for [... is Paris] + new [which]
output: hasEven with caching, agents still pay the full Prefill cost for repeated system prompts and tool definitions across different requests.
2. How Paged Attention Enables Cross‑Request Sharing
KV cache memory can explode: a 7B model needs ~0.5 MiB per token, so 1 K‑token contexts for 1 000 concurrent requests require ~50 GiB. Traditional allocation reserves a contiguous block per request, leading to fragmentation and duplicate storage of identical system prompts.
vLLM’s Paged Attention adopts OS‑style virtual memory paging: the KV cache is split into fixed‑size blocks (default 16 tokens). A block table maps logical addresses to physical GPU memory, and a content hash allows blocks with identical content to be shared.
hash(block_0) = sha256(NONE_HASH, tokens[0:16], extras)
hash(block_1) = sha256(hash(block_0), tokens[16:32], extras)
hash(block_2) = sha256(hash(block_1), tokens[32:48], extras)Each block’s hash incorporates its parent’s hash, so a hit on block_2 guarantees that blocks 0 and 1 are identical, enabling a single lookup to verify an entire prefix.
When a new request arrives, the system walks the block hashes, finds the longest continuous hit prefix, and only Prefills the missing suffix:
[block 0] [block 1] [block 2] [block 3]
HIT HIT MISS MISS
skip skip compute computeThus caching is content‑based, not session‑based. A single token change breaks the whole downstream chain because all subsequent block hashes become invalid.
3. System Design Around Caching
Prompt ordering. Because caching relies on prefix matches, stable prefixes should appear first. Claude Code orders prompts from most stable to most dynamic: static system prompt + tool definitions (shared across users) → project‑level CLAUDE.MD → session context → dialogue messages.
Avoid modifying existing content. Deleting or editing earlier messages invalidates the cached block chain. Keep the message array append‑only. When serializing JSON, use sort_keys=True so that semantically identical objects produce identical hashes.
Inject dynamic information via messages, not by changing the system prompt. Use a <system‑reminder> tag inside a user or tool message to convey updates while keeping the system prompt unchanged.
Tool set remains immutable. All tools stay at the front of the request; adding or removing a tool breaks the block chain. Instead, model state transitions (e.g., entering or exiting a planning mode) are expressed as dedicated tools ( EnterPlanMode, ExitPlanMode) that the model can invoke.
Deferred tool loading. Light‑weight stubs containing only tool names are sent initially; the model loads full schemas on demand via a ToolSearch tool, preserving a stable block hash.
Do not switch models mid‑session. KV cache is bound to a specific model; changing models forces a full rebuild of the block chain. Use a sub‑agent to hand off work when a model switch is required.
Context compression must be cache‑safe. When the context window is exhausted, generate a compression request that reuses the exact same system prompt, tool definitions, and message history, appending only the compression instruction. This “cache‑safe fork” allows the new request to share the existing block chain.
Anthropic monitors cache‑hit rate as a health metric; a drop triggers alerts because caching is essential for system viability, not just an optimization.
4. Tool Design
Tool design determines whether the model can correctly use the tools. Principles: give the model tools that match its current capabilities; each additional tool adds a decision point and raises the difficulty.
AskUserQuestion tool evolution. An initial attempt added a question parameter to ExitPlanTool, but the model mixed plan output with the question. A later attempt forced the model to emit a specific Markdown structure, which proved unstable. The final design introduced an independent AskUserQuestion tool that opens a modal dialog and blocks the agent loop until the user answers, providing a clear, structured output channel.
Tool design must align with the model’s processing style rather than imposing human‑centric interaction patterns.
TodoWrite → Task tool transition. Early versions used TodoWrite with periodic reminders; as model capability grew (Opus 4.5), frequent reminders became restrictive. The new Task tool supports dependencies and cross‑agent state sharing, allowing the model to add, modify, or delete tasks autonomously.
Progressive disclosure. Instead of stuffing full documentation into the system prompt, Claude Code creates a sub‑agent ( Claude Code Guide) that performs targeted document search when the main model detects a knowledge gap. This evolves from simple RAG to a recursive skill‑file system, shifting the model from passive context consumption to active information retrieval.
Tool definitions should be documented like API specs: include usage examples, boundary cases, and input format requirements. Defensive parameter design (e.g., enforcing absolute paths) eliminates whole classes of errors, and formats should match patterns familiar to the model’s pre‑training data.
5. Agent Architecture Choices
Anthropic recommends starting with the simplest solution and adding complexity only when necessary. Common patterns, ordered by increasing complexity:
Prompt chaining : break a task into sequential steps, feeding each step’s output as the next step’s input.
Routing : classify input and dispatch to different prompts or tools (e.g., simple queries to Haiku, complex ones to Sonnet).
Parallelism : either split a task into independent sub‑tasks that run concurrently, or run the same task multiple times and aggregate results (voting).
Orchestrator‑executor : a central LLM dynamically decomposes a request into sub‑tasks and assigns them to child LLMs, then aggregates the results.
Evaluator‑optimizer : one LLM generates output, another evaluates it against explicit criteria, and the loop iterates to improve quality.
Fully autonomous agents : the model drives the entire loop, deciding next actions based on environment feedback; this incurs higher cost and error accumulation, requiring sandbox testing and stop‑condition safeguards.
Frameworks accelerate prototyping but add abstraction layers that can obscure debugging. Direct API calls often remain more controllable and require fewer lines of code.
6. Common Pitfalls
Embedding timestamps or user‑specific data in the system prompt destroys cross‑user cache sharing; pass such data via messages instead.
Truncating or replacing conversation history mid‑session invalidates block hashes; keep the history append‑only.
Adding or removing tools after the request has started breaks the hash chain; model state transitions should be modeled as tool calls.
Using a separate API call for context compression creates a completely new block chain, forcing full recomputation; instead use a cache‑safe fork that reuses the original system prompt and tool definitions.
JSON serialization without deterministic key ordering yields different hashes for semantically identical objects; always serialize with json.dumps(x, sort_keys=True).
These observations are compiled from the following sources: Thariq’s “Prompt Caching Is Everything”, “Seeing Like an Agent — Tool Design in Claude Code”, “Todos → Tasks”; Lance Martin’s “Prompt Auto‑Caching with Claude”; Anthropic’s “Building Effective Agents” and official Prompt Caching documentation; Sankalp’s “How Prompt Caching Works — Paged Attention and Automatic Prefix Caching”; Kipply’s “Transformer Inference Arithmetic”; and Manus’s “Context Engineering for AI Agents”.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineer Programming
In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
