Designing a Complete RAG System from Zero: A Step‑by‑Step Interview Guide
This article outlines a full‑stack RAG architecture—offline parsing, query understanding, online retrieval, and context generation—explains six critical module interactions, and provides a concise interview framework for presenting the design from start to finish.
Four Core Modules of a Retrieval‑Augmented Generation (RAG) System
A production‑grade RAG pipeline can be divided into four sequential components that together transform raw documents into a conversational answer.
Offline Parsing (Knowledge‑Base Construction) – Extract text from source files (e.g., OCR with MinerU), split the text into semantic chunks of 300‑500 tokens with a 50‑token overlap, and enrich each chunk with metadata such as source, section, timestamp, and content type. Generate dense embeddings (e.g., BGE‑M3) and store them in a vector database (e.g., Milvus) while also building a BM25 inverted index for hybrid retrieval.
Query Understanding (Pre‑processing) – For every incoming user query, perform intent classification, named‑entity extraction, and query rewriting/expansion. The resulting intent determines the retrieval route (knowledge‑base, computation module, NL2SQL, or refusal) and the extracted entities become filter conditions (time range, document source, etc.).
Online Retrieval (Recall & Re‑ranking) – Execute a hybrid search: run a vector similarity search against Milvus, run a BM25 keyword search, and fuse the two result lists with Reciprocal Rank Fusion (RRF). Apply a cross‑encoder re‑ranker to the fused list and keep the top‑5 most relevant chunks, ordered by relevance and labeled with numeric identifiers for prompt construction.
Context Generation (LLM Answering) – Assemble a prompt that includes the user query, the selected chunks, and explicit instructions to cite sources and suppress hallucinations. Feed the prompt to a large language model, handle multi‑turn dialogue state, and return the final answer to the user.
Six Key Inter‑module Linkages
1. Chunk Size ↔ LLM Context Window
The chunk length must fit within the LLM’s token window. Empirically, 300‑500 tokens per chunk with a 50‑token overlap yields a good balance between retrieval recall and generation quality. Oversized chunks reduce coverage; undersized chunks fragment semantics and increase the number of chunks needed for context.
2. Query Understanding ↔ Retrieval Strategy
The intent output selects the retrieval chain (e.g., pure knowledge‑base vs. computational module). Extracted entities become filter predicates (e.g., timestamp >= 2023‑01‑01). Query rewriting defines the exact text sent to the vector store. Misalignment leads either to unused retrieval results or to completely misrouted queries.
3. Retrieval ↔ Context Generation
Returning the top‑10 raw fragments is not optimal. In practice, after re‑ranking we keep the top‑5 fragments, order them by relevance, and prepend a numeric label (e.g., [1] …) in the prompt. This provides sufficient information while limiting noise and avoiding the “Lost in the Middle” effect where the LLM focuses only on the beginning and end of a long prompt.
4. Generation Feedback ↔ Retrieval
If the LLM emits a “cannot find answer” response, the system automatically relaxes retrieval constraints (e.g., lower similarity threshold or switch to a pure BM25 search) and performs a second retrieval pass before re‑invoking the LLM. Retries are limited to 1‑2 attempts and each retry must use a different retrieval strategy to prevent infinite loops.
5. End‑to‑End Monitoring
Instrument each module to emit logs:
Query Understanding – intent, entities, rewritten query.
Retrieval – fragment IDs, similarity scores, rank positions.
Generation – LLM answer, confidence score, cited sources.
Tracing these signals enables rapid pinpointing of the failure point (parsing, recall, or generation). A weekly audit of a sample of “bad cases” (e.g., 50) helps prioritize optimization efforts.
6. Cross‑Module Caching
Three‑layer caching reduces latency dramatically:
Embedding cache – Store query embeddings in Redis to avoid recomputation.
Retrieval result cache – Cache high‑frequency query‑to‑fragment mappings, bypassing vector lookup.
Answer cache – Cache FAQ‑style answers for instant response.
Appropriate TTLs ensure stale knowledge is evicted after knowledge‑base updates, allowing hot queries to be answered in <50 ms instead of seconds.
Interview Blueprint: Designing a RAG System from Scratch
Overview (≈30 s) – State the four modules and their execution order.
Offline Process (≈1 min) – Describe document ingestion (OCR → MinerU), chunking (300‑500 tokens + 50 overlap), metadata enrichment, embedding generation ( BGE‑M3), and storage in Milvus plus BM25 index.
Online Flow (≈2 min) – Explain query understanding (intent, entities, rewriting), dual vector + BM25 retrieval, RRF fusion, cross‑encoder re‑ranking to top‑5, prompt construction with source‑citation directives, and LLM inference.
Critical Linkages (≈1 min) – Emphasize chunk‑size vs. LLM window, intent‑driven retrieval routing, feedback‑driven re‑retrieval, and three‑layer caching that enables sub‑50 ms latency.
This structured answer demonstrates a holistic view of the system rather than isolated component knowledge.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
