Wu Shixiong's Large Model Academy
Nov 22, 2025 · Artificial Intelligence
Why Your RAG System Slows Down Over Time and How to Fix It
The article explains why a production Retrieval‑Augmented Generation (RAG) system becomes slower as it runs—due to growing embedding costs, expanding vector databases, heavier re‑ranking, and larger prompts—and provides concrete engineering optimizations such as batching, async concurrency, caching, partitioned retrieval, HNSW tuning, replica scaling, answer caching, and prompt sparsification to keep performance stable.
AI engineeringPerformance optimizationRAG
0 likes · 10 min read
