Artificial Intelligence 13 min read

How Prompt Caching Works in LLMs and How to Write More Efficient Prompts

The article explains that LLM prompt caching reuses internal KV states rather than full answers, compares provider implementations, quantifies cost and latency savings, and provides concrete guidelines for structuring prompts to maximize cache hits, along with monitoring signals and a practical evaluation checklist.

AI Step-by-Step

May 26, 2026

How Prompt Caching Works in LLMs and How to Write More Efficient Prompts

1. What the cache actually stores

The cache does not hold the final answer or the original text; it stores the KV (Key‑Value) vectors generated by each Transformer layer while processing the input tokens. When a request shares the same prefix, the service reuses these pre‑computed KV states, saving the computation of the fixed part of the prompt.

2. Provider implementation differences

Different vendors expose distinct control modes, cache lifetimes, and minimum prefix lengths:

OpenAI : automatic matching, 5‑10 minutes without access, minimum prefix 1024 tokens.

Anthropic : explicit cache_control, 5 minutes with automatic renewal, minimum prefix 1024 tokens.

Google Gemini : explicit cache creation, 1‑48 hours, no hard minimum prefix.

DeepSeek : fully automatic, cache duration not disclosed, no published minimum.

AWS Bedrock : hybrid automatic + explicit, up to 1 hour, minimum prefix 1024 tokens.

Key takeaways: explicit control lets you decide which parts are cached (Anthropic, Google), while automatic modes require no prompt changes (OpenAI, DeepSeek). Cache lifetimes range from minutes to two days, influencing which workloads benefit most.

3. Cost and latency benefits

Cache savings are measured in two dimensions: reduced billing cost and lower Time‑to‑First‑Token (TTFT).

Cost comparison (Anthropic Claude 3.5 Sonnet)

Without cache: 4,500 input tokens × $3 / MTok = $0.0135 per request.

With 4,000‑token prefix cached: (4,000 × $0.3 / MTok) + (500 × $3 / MTok) = $0.0027 per request.

This represents an 80 % cost reduction for a single request; batch scenarios amplify the difference.

Benchmarks from Fireworks AI (2024) report 30‑80 % cost cuts for RAG and multi‑agent workflows, while DeepSeek cites 80‑90 % cache‑hit rates in typical multi‑turn conversations.

Latency benefits

Prefill (reading and processing the prompt) usually consumes 30‑70 % of total response time. Skipping this stage via prefix caching can reduce TTFT by up to tenfold, especially when the cached prefix dominates the input (e.g., a 100 k‑token contract where 90 k tokens are static).

4. Detecting cache hits from API responses

OpenAI : usage.prompt_tokens_details.cached_tokens > 0 Anthropic : usage.cache_creation_input_tokens / cache_read_input_tokens Google Gemini : presence of cachedContent field, no extra charge when hit

DeepSeek : usage.prompt_cache_hit_tokens > 0 Monitoring these fields lets you gauge cache effectiveness; a sustained cached‑token share below 30 % signals room for prompt redesign.

5. Prompt design for high cache hit rate

All major implementations favor "prefix caching" – the longest continuous prefix from the start of the prompt must be identical across requests.

✓ Place fixed content first : system prompt, role definition, tool specs, output format, few‑shot examples, long documents, business rules.

✗ Place dynamic content later : user query, timestamps, random IDs, per‑request retrieval results, conversation state.

Core principle: keep the fixed portion at the front and the variable portion at the back to maximise a stable, long prefix.

High‑hit example:

[系统提示词：你是一个合同审查助手...]
[审查标准：1. 付款条款... 2. 违约责任...]
[相关法律条文：第 X 条...]
[本次合同内容：...]

Low‑hit (counter‑example):

[当前时间：2026-05-25 14:32:18]
[请求 ID：req_7a3f9d2e]
[用户问题：审查这份合同]
[系统提示词 + 审查标准 + 法律条文...]

Because the timestamp and request ID differ each call, the prefix mismatches immediately, preventing cache reuse. Even tiny changes—reordering few‑shot examples or inserting a dynamic token inside the fixed block—can break the cache.

6. Common misconceptions

Cache hit ≠ returning the previous answer : only KV states are reused; the model still generates a fresh response.

Cache lives on the client : it resides on the provider’s inference servers; you cannot directly view or clear it.

Any repeated fragment hits : most systems only match a continuous prefix from the start; mid‑sentence repeats are ignored.

Cache is permanent : lifetimes vary from minutes to hours; after expiry the prefix must be recomputed.

7. When caching provides little benefit

Single‑turn interactions where the cache cannot be established before it expires.

Highly dynamic prompts where user‑specific or time‑dependent data changes the prefix each request.

Very short prompts (few hundred tokens); even full cache hits save negligible compute.

Self‑hosted models (e.g., vLLM, SGLang): the API‑level discount disappears, though TTFT improvements still apply.

8. Evaluation checklist

Prompt structure : Are system prompts, tool definitions, and format requirements placed at the very beginning?

Are timestamps, random IDs, and request identifiers moved to the end or removed?

Is the fixed portion templated to avoid minute variations?

Is the order and content of few‑shot examples stable?

Application architecture : In multi‑turn dialogs, is conversation history appended after the fixed prefix?

In RAG scenarios, are knowledge‑base snippets injected before the user query?

During batch processing, are similar requests grouped to maximise prefix reuse?

Is the proportion of cached_tokens in billing monitored?

Summary : Put immutable prompt parts first and volatile parts last; this maximises prefix length, leading to substantial cost and latency reductions. Monitor the cached‑token fields in API responses to validate prompt structure effectiveness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM prompt engineering latency optimization AI inference cost reduction prompt caching

Written by

AI Step-by-Step

Sharing AI knowledge, practical implementation records, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. What the cache actually stores

2. Provider implementation differences

3. Cost and latency benefits

Cost comparison (Anthropic Claude 3.5 Sonnet)

Latency benefits

4. Detecting cache hits from API responses

5. Prompt design for high cache hit rate

6. Common misconceptions

7. When caching provides little benefit

8. Evaluation checklist

AI Step-by-Step

How this landed with the community

Was this worth your time?

0 Comments

Cost comparison (Anthropic Claude 3.5 Sonnet)