Artificial Intelligence 9 min read

Claude Adds Prompt Cache Diagnostics to Pinpoint Token Cost Spikes

Claude's new Prompt Cache Diagnostics feature lets developers see exactly why a cache miss occurs and how many tokens were wasted, providing beta‑header usage, Python examples, supported miss reasons, limitations, and privacy guarantees to help optimize token costs.

AI Engineering

May 19, 2026

Claude Adds Prompt Cache Diagnostics to Pinpoint Token Cost Spikes

Claude has launched a Prompt Cache Diagnostics feature in the Claude console, addressing the long‑standing difficulty of identifying why token usage suddenly spikes when a cache miss occurs.

Background

Prompt caching reduces API costs and latency by reusing the token‑heavy prefix of a prompt when two requests have identical byte‑level prefixes. Any minor change—such as an added timestamp, reordered tool list, or extra space—breaks the strict matching rule and forces a full re‑tokenization.

How the Feature Works

When a request does not hit the cache, the response now includes a diagnostics field that pinpoints the exact location of the change and estimates the token loss. To enable the feature, developers must add the beta header cache-diagnosis-2026-04-07 and pass the previous_message_id from the prior response.

The first request uses previous_message_id: null. Subsequent requests supply the ID returned in the previous response, allowing the API to compare the two request structures and return the first divergence point.

Python Usage Example

client = anthropic.Anthropic()

SYSTEM = "You are an AI assistant analyzing a large document. <document>...</document>"

# Turn 1: opt in with previous_message_id=None
r1 = client.beta.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system=SYSTEM,
    messages=[{"role": "user", "content": "Summarize section 1."}],
    diagnostics={"previous_message_id": None},
    betas=["cache-diagnosis-2026-04-07"],
)

# Turn 2: reference the previous response id
r2 = client.beta.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system=SYSTEM,
    messages=[
        {"role": "user", "content": "Summarize section 1."},
        {"role": "assistant", "content": r1.content},
        {"role": "user", "content": "Now summarize section 2."},
    ],
    diagnostics={"previous_message_id": r1.id},
    betas=["cache-diagnosis-2026-04-07"],
)

diagnostics = r2.diagnostics
if diagnostics is None:
    print("No divergence detected.")
elif diagnostics.cache_miss_reason is None:
    print("Comparison still pending.")
else:
    print(f"cache_miss_reason: {diagnostics.cache_miss_reason.type}")

For streaming responses, the diagnostics information appears in the message_start event.

Supported Miss Reasons

Model change – different model used between requests (e.g., routing, A/B test, or downgrade).

System prompt change – dynamic content such as timestamps or request IDs inserted.

Tool list change – addition, removal, reordering, or JSON‑format differences in tool parameters.

Message history change – alterations in content or order of the message list, including truncation or re‑serialization.

Typical Diagnostic Response

{
  "id": "msg_01Xyz...",
  "type": "message",
  "role": "assistant",
  "content": [{ "type": "text", "text": "..." }],
  "usage": {
    "input_tokens": 42,
    "cache_read_input_tokens": 0,
    "cache_creation_input_tokens": 41850,
    "output_tokens": 210
  },
  "diagnostics": {
    "cache_miss_reason": {
      "type": "system_changed",
      "cache_missed_input_tokens": 41850
    }
  }
}

Developers can combine the diagnostics output with actual cache‑hit metrics to interpret the result:

Result null, high cache‑read tokens: All normal; prompt prefix is stable and cache hits.

Result null, low/zero cache‑read tokens: Structure unchanged but cache expired; shorten request interval.

Result *_changed, low/zero cache‑read tokens: Structural change caused miss; fix the indicated part.

Result *_changed, high cache‑read tokens: Change occurs later in the prompt; earlier sections still hit cache, so fixing is lower priority.

Limitations

Only supports Claude's native API; not available for Claude hosted on Amazon Bedrock or Google Vertex AI.

Request fingerprints are retained only within the same organization and workspace, with a limited retention period.

Deep differences in very long conversations may be unrecognizable, returning an unavailable status.

The diagnostic call is best‑effort and never blocks normal requests; if information cannot be gathered, an appropriate status code is returned.

The feature complies with a zero‑data‑retention policy: raw prompts and responses are not stored, only hashed fingerprints and token estimates, which are deleted shortly after.

Developer Feedback

Early adopters praise the update as the most useful Claude improvement, noting that previously they were "flying blind" with cached prompts and could not tell when hit rates dropped. The ability to see exact change points greatly aids cost optimization.

Some developers suggest extending the feature to automatically fallback to the unchanged cache portion, further reducing token waste.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Development Claude Anthropic Prompt Caching Token Optimization API Diagnostics

Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.