How to Slash RAG First‑Token Latency: Practical Engineering Strategies

This guide breaks down the three layers of a RAG pipeline—embedding, vector retrieval, and system architecture—and provides concrete engineering tactics such as batch embedding, async concurrency, caching, ANN indexing, partitioning, connection pooling, and async pipelines to dramatically reduce Time‑to‑First‑Token latency.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
How to Slash RAG First‑Token Latency: Practical Engineering Strategies

1. Where does first‑token latency come from?

The RAG workflow consists of four steps: embedding, vector retrieval, prompt assembly, and LLM generation. The main TTFT bottlenecks are the embedding API wait time, the vector search time, and the lack of concurrency or caching in the system. In other words, the delay occurs before the LLM is invoked.

2. Embedding stage: how to minimise OpenAI latency

Batch Embedding

OpenAI’s embedding API accepts multiple texts in a single request, allowing you to send an array of chunks at once. ["text1", "text2", "text3", ...] Reduces network round‑trip latency

Increases throughput

Lowers the risk of API rate‑limit errors

Remember the token limit (≈8k) and split batches accordingly. In practice, batch processing can drop embedding time from hundreds of milliseconds to a few dozen milliseconds.

Async concurrency (asyncio)

Instead of a single‑threaded loop that sends a request, waits, then sends the next, use asynchronous calls so the CPU can work on other requests while waiting for the API.

Overall throughput can increase 5‑10×

Stable concurrency is 5‑10 parallel calls; higher concurrency may hit 429 errors

Embedding cache

Cache the query‑to‑vector mapping in Redis or another KV store. Repeated queries hit the cache 30‑50% of the time, and pre‑computing corpus embeddings eliminates on‑the‑fly calculations.

3. Vector retrieval stage: make Milvus / Faiss return in milliseconds

Build ANN index (HNSW / IVF)

HNSW offers the best speed‑accuracy trade‑off for large‑scale vectors.

M: controls graph connectivity

efConstruction: index quality

efSearch: search accuracy vs speed

Typical settings: M=16, efConstruction=128, efSearch=64.

Partition / sharding

Split the vector collection by topic, time, or source so queries only scan relevant partitions, reducing the search space by 50‑90%.

Connection pool + batch query

Milvus supports sending multiple query vectors in one request and using multiple connections for concurrent queries. [v1, v2, v3, …] Batching queries cuts network round‑trips, the fastest optimisation.

GPU acceleration (optional)

GPU‑enabled vector databases can help when you have high‑frequency queries, millions of vectors, and strict latency requirements, but they add cost and operational complexity.

4. System‑level optimisation: pipeline the whole flow

Full‑link async pipeline

Traditional flow: Embedding → Retrieval → Prompt → LLM. After async pipeline:

Embedding wait time can be overlapped with retrieval

Retrieval wait time can be overlapped with prompt preparation

Multiple user requests no longer block each other

Result: higher QPS, lower first‑token latency, better CPU/IO utilisation.

Three‑layer cache (Embedding / Retrieval / Answer)

Embedding cache avoids repeated vector computation

Retrieval cache skips identical vector searches

Answer cache returns static FAQ responses without invoking RAG

These caches together reduce API calls, Milvus queries, and LLM invocations by 30‑60%.

Horizontal scaling

Deploy multiple query nodes

Configure several replicas for the vector store

Use load‑balanced LLM instances

This satisfies high QPS demands.

5. Concise interview answer

RAG’s first‑token latency is mainly caused by embedding and vector retrieval. Embedding can be accelerated with batch requests, async concurrency, and KV caching; retrieval can be sped up with HNSW indexes, partition filtering, and batch queries. System‑level async pipelines and three‑layer caching further cut latency by tens to hundreds of milliseconds.
RAGEmbeddingAsync PipelineTTFT
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.