Why Your RAG System Slows Down Over Time and How to Fix It
The article explains why a production Retrieval‑Augmented Generation (RAG) system becomes slower as it runs—due to growing embedding costs, expanding vector databases, heavier re‑ranking, and larger prompts—and provides concrete engineering optimizations such as batching, async concurrency, caching, partitioned retrieval, HNSW tuning, replica scaling, answer caching, and prompt sparsification to keep performance stable.
Why RAG latency grows over time
A production RAG pipeline often starts with sub‑second response times, but after weeks or months the latency can increase dramatically. The slowdown originates from four tightly coupled stages:
Embedding computation becomes heavier as more documents are added.
The vector store grows, making similarity search slower.
Re‑ranking models are invoked more often and become a bottleneck.
Prompt construction accumulates tokens, stressing the LLM generation step.
1) Embedding cost escalation
Embedding is the most expensive operation in a RAG pipeline. Over time new web pages, increasing query volume, and repeated processing of near‑duplicate texts cause the embedding API latency to rise from ~200 ms to >1500 ms. Caching embeddings (e.g., in Redis) can eliminate 50‑90 % of redundant calls.
2) Vector‑store scaling
Search latency in Milvus/HNSW/IVF correlates strongly with vector count:
~100 k vectors → a few ms
~1 M vectors → tens of ms
~10 M vectors → hundreds of ms or seconds
Uncontrolled growth, especially in dynamic RAG that crawls pages daily, leads to exponential latency growth.
Key mitigation strategies:
Partitioned retrieval : filter by source or time to limit the search space.
Expired vector cleanup : regularly delete stale or irrelevant vectors.
HNSW parameter tuning : adjust efSearch and efConstruction for a balance of recall and speed.
3) Re‑rank overhead
Adding a cross‑encoder re‑ranker improves accuracy but also adds latency. As the vector store expands, recall increases, causing more documents to be sent to the re‑ranker and resulting in a linear latency increase.
To keep the re‑ranker from becoming the bottleneck:
Reduce the number of documents passed to the re‑ranker (e.g., top‑3 ~ top‑5 instead of top‑20).
Improve retrieval‑stage recall so fewer candidates are needed.
Avoid unnecessary API calls.
4) Prompt size and token throughput
Dynamic RAG prompts can become heavy when many retrieved chunks are concatenated, cache misses trigger re‑retrieval, and large documents increase input length. Larger inputs slow LLM generation and reduce overall throughput.
Effective strategy: keep prompts sparse—include only the most relevant evidence.
Systematic reverse optimization for RAG
Embedding optimization: batch + cache + async concurrency
Batch API calls : group multiple texts into a single request to reduce round‑trip latency.
Asynchronous concurrency (e.g., asyncio.Semaphore): limit parallel calls to 5‑10 to avoid OpenAI latency spikes.
Embedding cache (Redis): normalize text, hash it, and store the embedding under the hash key. In dynamic RAG this alone can boost speed by >50 %.
Vector database optimization: HNSW + partition + cleanup + replicas
HNSW index with parameters M=16, efConstruction=128 for stable performance compared to IVF.
efSearch tuning : larger efSearch improves recall but slows queries; find a suitable trade‑off.
Partitioned retrieval : split the store by time or source to avoid scanning the entire collection.
Periodic cleanup : implement an expiration policy to purge stale vectors; otherwise latency grows indefinitely.
Multiple replicas (e.g., replica_number=4) to increase throughput under concurrent load.
Answer cache
FAQ‑type questions can be cached, reducing latency from ~800 ms to ~20 ms.
Prompt optimization: keep only the most useful evidence
Select top‑3 ~ top‑5 results instead of a large candidate set.
Summarize each retrieved chunk before insertion.
Use Chain‑of‑Thought prompting to force the model to analyse evidence before answering.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
