Why Your RAG System Slows Down Over Time and How to Fix It

The article explains why a production Retrieval‑Augmented Generation (RAG) system becomes slower as it runs—due to growing embedding costs, expanding vector databases, heavier re‑ranking, and larger prompts—and provides concrete engineering optimizations such as batching, async concurrency, caching, partitioned retrieval, HNSW tuning, replica scaling, answer caching, and prompt sparsification to keep performance stable.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Why Your RAG System Slows Down Over Time and How to Fix It

Why RAG latency grows over time

A production RAG pipeline often starts with sub‑second response times, but after weeks or months the latency can increase dramatically. The slowdown originates from four tightly coupled stages:

Embedding computation becomes heavier as more documents are added.

The vector store grows, making similarity search slower.

Re‑ranking models are invoked more often and become a bottleneck.

Prompt construction accumulates tokens, stressing the LLM generation step.

1) Embedding cost escalation

Embedding is the most expensive operation in a RAG pipeline. Over time new web pages, increasing query volume, and repeated processing of near‑duplicate texts cause the embedding API latency to rise from ~200 ms to >1500 ms. Caching embeddings (e.g., in Redis) can eliminate 50‑90 % of redundant calls.

2) Vector‑store scaling

Search latency in Milvus/HNSW/IVF correlates strongly with vector count:

~100 k vectors → a few ms

~1 M vectors → tens of ms

~10 M vectors → hundreds of ms or seconds

Uncontrolled growth, especially in dynamic RAG that crawls pages daily, leads to exponential latency growth.

Key mitigation strategies:

Partitioned retrieval : filter by source or time to limit the search space.

Expired vector cleanup : regularly delete stale or irrelevant vectors.

HNSW parameter tuning : adjust efSearch and efConstruction for a balance of recall and speed.

3) Re‑rank overhead

Adding a cross‑encoder re‑ranker improves accuracy but also adds latency. As the vector store expands, recall increases, causing more documents to be sent to the re‑ranker and resulting in a linear latency increase.

To keep the re‑ranker from becoming the bottleneck:

Reduce the number of documents passed to the re‑ranker (e.g., top‑3 ~ top‑5 instead of top‑20).

Improve retrieval‑stage recall so fewer candidates are needed.

Avoid unnecessary API calls.

4) Prompt size and token throughput

Dynamic RAG prompts can become heavy when many retrieved chunks are concatenated, cache misses trigger re‑retrieval, and large documents increase input length. Larger inputs slow LLM generation and reduce overall throughput.

Effective strategy: keep prompts sparse—include only the most relevant evidence.

Systematic reverse optimization for RAG

Embedding optimization: batch + cache + async concurrency

Batch API calls : group multiple texts into a single request to reduce round‑trip latency.

Asynchronous concurrency (e.g., asyncio.Semaphore): limit parallel calls to 5‑10 to avoid OpenAI latency spikes.

Embedding cache (Redis): normalize text, hash it, and store the embedding under the hash key. In dynamic RAG this alone can boost speed by >50 %.

Vector database optimization: HNSW + partition + cleanup + replicas

HNSW index with parameters M=16, efConstruction=128 for stable performance compared to IVF.

efSearch tuning : larger efSearch improves recall but slows queries; find a suitable trade‑off.

Partitioned retrieval : split the store by time or source to avoid scanning the entire collection.

Periodic cleanup : implement an expiration policy to purge stale vectors; otherwise latency grows indefinitely.

Multiple replicas (e.g., replica_number=4) to increase throughput under concurrent load.

Answer cache

FAQ‑type questions can be cached, reducing latency from ~800 ms to ~20 ms.

Prompt optimization: keep only the most useful evidence

Select top‑3 ~ top‑5 results instead of a large candidate set.

Summarize each retrieved chunk before insertion.

Use Chain‑of‑Thought prompting to force the model to analyse evidence before answering.

Performance optimizationRAGAI engineeringembedding cacheretrieval-augmented generation
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.