How to Slash RAG First‑Token Latency: Practical Engineering Strategies
This guide breaks down the three layers of a RAG pipeline—embedding, vector retrieval, and system architecture—and provides concrete engineering tactics such as batch embedding, async concurrency, caching, ANN indexing, partitioning, connection pooling, and async pipelines to dramatically reduce Time‑to‑First‑Token latency.
1. Where does first‑token latency come from?
The RAG workflow consists of four steps: embedding, vector retrieval, prompt assembly, and LLM generation. The main TTFT bottlenecks are the embedding API wait time, the vector search time, and the lack of concurrency or caching in the system. In other words, the delay occurs before the LLM is invoked.
2. Embedding stage: how to minimise OpenAI latency
Batch Embedding
OpenAI’s embedding API accepts multiple texts in a single request, allowing you to send an array of chunks at once. ["text1", "text2", "text3", ...] Reduces network round‑trip latency
Increases throughput
Lowers the risk of API rate‑limit errors
Remember the token limit (≈8k) and split batches accordingly. In practice, batch processing can drop embedding time from hundreds of milliseconds to a few dozen milliseconds.
Async concurrency (asyncio)
Instead of a single‑threaded loop that sends a request, waits, then sends the next, use asynchronous calls so the CPU can work on other requests while waiting for the API.
Overall throughput can increase 5‑10×
Stable concurrency is 5‑10 parallel calls; higher concurrency may hit 429 errors
Embedding cache
Cache the query‑to‑vector mapping in Redis or another KV store. Repeated queries hit the cache 30‑50% of the time, and pre‑computing corpus embeddings eliminates on‑the‑fly calculations.
3. Vector retrieval stage: make Milvus / Faiss return in milliseconds
Build ANN index (HNSW / IVF)
HNSW offers the best speed‑accuracy trade‑off for large‑scale vectors.
M: controls graph connectivity
efConstruction: index quality
efSearch: search accuracy vs speed
Typical settings: M=16, efConstruction=128, efSearch=64.
Partition / sharding
Split the vector collection by topic, time, or source so queries only scan relevant partitions, reducing the search space by 50‑90%.
Connection pool + batch query
Milvus supports sending multiple query vectors in one request and using multiple connections for concurrent queries. [v1, v2, v3, …] Batching queries cuts network round‑trips, the fastest optimisation.
GPU acceleration (optional)
GPU‑enabled vector databases can help when you have high‑frequency queries, millions of vectors, and strict latency requirements, but they add cost and operational complexity.
4. System‑level optimisation: pipeline the whole flow
Full‑link async pipeline
Traditional flow: Embedding → Retrieval → Prompt → LLM. After async pipeline:
Embedding wait time can be overlapped with retrieval
Retrieval wait time can be overlapped with prompt preparation
Multiple user requests no longer block each other
Result: higher QPS, lower first‑token latency, better CPU/IO utilisation.
Three‑layer cache (Embedding / Retrieval / Answer)
Embedding cache avoids repeated vector computation
Retrieval cache skips identical vector searches
Answer cache returns static FAQ responses without invoking RAG
These caches together reduce API calls, Milvus queries, and LLM invocations by 30‑60%.
Horizontal scaling
Deploy multiple query nodes
Configure several replicas for the vector store
Use load‑balanced LLM instances
This satisfies high QPS demands.
5. Concise interview answer
RAG’s first‑token latency is mainly caused by embedding and vector retrieval. Embedding can be accelerated with batch requests, async concurrency, and KV caching; retrieval can be sped up with HNSW indexes, partition filtering, and batch queries. System‑level async pipelines and three‑layer caching further cut latency by tens to hundreds of milliseconds.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
