How to Build a Multi‑Layer Cache for Dynamic RAG Systems
This article explains why dynamic Retrieval‑Augmented Generation (RAG) requires a layered caching strategy rather than simple result caching, details a four‑level cache architecture—including embedding, search, answer, and pipeline caches—provides practical key‑generation and TTL guidelines, and outlines dirty‑data defenses to keep caches consistent and performant.
1. Cache is about caching stages, not final results
In dynamic RAG the most costly parts are I/O and retrieval, not the language model itself. Caching the final answer can cause stale data when upstream documents change, leading to incorrect responses. Therefore the cache must be designed as a layered system: cache input → cache intermediate results → cache retrieval → optional cache final answer, similar to a CDN for RAG.
2. Four‑level cache architecture
Embedding cache (most important) Embedding generation is expensive (API cost, network latency, high duplicate request rate). The recommended practice is to persist all embedding vectors in Redis with a TTL of at least 30 days. Cache keys are generated by hashing a normalized version of the source text: hash(normalized_text) Normalization steps include removing spaces, converting to lowercase, fixing punctuation, and denoising, which greatly improves hit rate.
Search result cache Vector‑search (e.g., Milvus/HNSW) also incurs cost, especially under high concurrency and mixed hot/cold documents. Cache the top‑k document IDs for a given question:
cache_key = hash(question_text) + k
cache_value = list_of_top_k_doc_idsSet a reasonable TTL (typically 1 hour to 1 day) and validate the cache against the vector‑store’s update timestamp to avoid stale results.
Answer cache For high‑frequency FAQ or strongly structured queries, cache the final answer. Benefits include: First‑token output becomes extremely fast. Reduces pressure on the vector store. Lowers LLM inference cost. Typical use cases are static policies, procedures, or other deterministic answers. Only cache questions with high hit rates.
Pipeline (link) cache Reuse intermediate pipeline nodes (embedding → retrieval → re‑rank → prompt → generation) across concurrent requests. Implement with FastAPI + thread pool or asyncio so that identical sub‑tasks share results, effectively doubling system throughput.
Dirty‑data defense (cache consistency checks)
Each cache layer must include validation to prevent stale data from contaminating retrieval:
Embedding validation – check the text version number before using a cached embedding.
Search cache validation – compare the cached result’s timestamp with the vector store’s latest update.
Answer cache validation – verify that the source document(s) referenced by the answer have not changed.
Pipeline cache validation – ensure intermediate nodes are still valid before reusing them.
If upstream data changes, invalidate the corresponding caches (embedding, search, answer, prompt) so that the system always returns up‑to‑date results and avoids hallucinations.
3. Interview tip for dynamic RAG caching
Dynamic RAG uses a layered cache rather than a simple result cache; consistency is maintained through text normalization, TTL limits, vector‑store timestamps, and document version checks.
When upstream data changes, the system automatically clears the affected cache layers, guaranteeing accurate recall.
These engineering practices are what interviewers look for when assessing real‑world RAG projects.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
