Comprehensive RAG Interview Q&A: 22 In-Depth Questions and Answers
This extensive interview guide covers 22 core RAG questions, detailing the definition, workflow, embedding selection, vector database choices, retrieval optimization, multi‑turn handling, context compression, evaluation metrics, knowledge‑graph integration, operational challenges, Agentic and hybrid RAG, document update strategies, similarity algorithms, and hallucination mitigation, providing concrete examples and practical advice for AI interview preparation.
Question 1: What is RAG and what problem does it solve?
RAG (Retrieval‑Augmented Generation) addresses the knowledge limitation and hallucination problems of large language models by retrieving relevant information from an external knowledge base before generation, thereby expanding the model's knowledge scope and improving answer accuracy and traceability.
Question 2: What is the basic RAG workflow?
The workflow consists of two stages:
Indexing stage :
Split documents into appropriately sized chunks.
Encode each chunk with an embedding model to obtain vectors.
Store vectors in a vector database together with the original text as metadata.
Query stage :
Encode the user question into a vector.
Perform similarity search in the vector database to retrieve the most relevant chunks.
Combine the retrieved chunks with the question to form a prompt and let the LLM generate an answer.
Retrieval accuracy and recall are critical; poor retrieval degrades answer quality.
Question 3: Factors for selecting an embedding model
Four key factors:
Language support : Chinese models such as BGE, M3E, Text2Vec; English models include OpenAI's text-embedding-ada-002 and the sentence‑transformers family.
Vector dimension : Higher dimensions (e.g., 384, 768, 1024, 1536) improve expressive power but increase storage and compute cost.
Context length : Must cover the chunk size; common limits are 512, 1024, 2048, 8192 tokens.
Performance metrics : Evaluate on standard benchmarks using MRR, NDCG, Recall@K, etc.
In practice, start with open‑source models; if performance is insufficient, consider commercial APIs.
Question 4: Vector database options and characteristics
Faiss : Facebook's open‑source library, suitable for small‑scale local deployment, supports many index types but lacks distributed persistence.
Milvus : Dedicated vector DB with distributed deployment, multiple index types, and hybrid queries; ideal for large‑scale production.
Pinecone : Managed service, plug‑and‑play but higher cost.
Weaviate : Supports vector and semantic search via GraphQL; feature‑rich but higher learning curve.
pgvector : PostgreSQL extension, leverages existing PG infrastructure.
Elasticsearch : Adds vector search to a traditional keyword engine, useful for combined keyword‑vector scenarios.
Selection should consider data scale, query QPS, and team tech stack.
Question 5: Improving RAG retrieval accuracy
Three layers of optimization:
Query optimization : query expansion with LLM‑generated synonyms, query decomposition into sub‑questions, and HyDE (hypothetical document generation) to retrieve better matches.
Index optimization : semantic chunking, metadata enrichment, and storing document summaries for coarse ranking.
Reranking : Use a more precise model such as a Cross‑Encoder to rescore retrieved chunks.
Combining these techniques significantly boosts retrieval performance.
Question 6: Handling multi‑turn dialogue in RAG
Strategies include:
History concatenation : Append previous turns to the query (may introduce noise and length issues).
Query rewriting : Use an LLM to rewrite the current question into an independent, complete query.
History summarization : Maintain a concise summary of the conversation as context.
Independent retrieval : Retrieve for both history and current question separately, then merge results.
Query rewriting often yields the best trade‑off between relevance and cost.
Question 7: Context compression for retrieved results
Compression mitigates prompt length limits and reduces irrelevant information. Methods:
Map‑Reduce : Process each document segment independently, keep only relevant parts.
Refine : Iteratively generate an answer using a subset of documents, then incorporate more.
Relevance filtering : Light‑weight model scores each segment, retaining high‑scoring ones.
LLM compression : Summarize chunks with a language model.
In practice, a coarse filter followed by LLM‑based summarization works well.
Question 8: Evaluating RAG systems
Evaluation separates retrieval and generation:
Retrieval metrics : Recall@K, MRR, NDCG.
Generation metrics : Answer relevance, factual fidelity, completeness.
Methods include manual test‑set annotation, LLM‑based automatic scoring, and end‑to‑end human evaluation. Continuous evaluation loops with user feedback are essential for production.
Question 9: Combining RAG with knowledge graphs
RAG excels at unstructured text, while knowledge graphs handle structured relationships. Integration patterns:
Graph‑enhanced retrieval : Entity linking and relation reasoning expand the query before text retrieval.
Structured RAG : Retrieve triples directly from the graph as additional context.
Hybrid reasoning : Let the model attend to both text fragments and graph triples.
Implementation can use frameworks like GraphRAG; key challenges are accurate entity linking and graph coverage.
Question 10: Dealing with knowledge conflicts
When retrieved chunks contain contradictory information, strategies include:
Confidence ranking : Weight sources by authority, recency, and relevance.
Timeliness priority : Prefer the most recent documents for time‑sensitive queries.
Multi‑source verification : Require independent sources to agree before trusting.
Uncertainty expression : Let the model explicitly state that multiple viewpoints exist.
Conflict detection : Use an LLM to flag contradictions and either ask the user for clarification or present all viewpoints.
The appropriate approach depends on the application’s tolerance for ambiguity.
Question 11: RAG performance optimization
Optimization focuses on indexing, retrieval, and generation:
Indexing : Choose suitable vector index (HNSW for high query volume, IVF for frequent updates), shard large indexes, and apply quantization (e.g., PQ) to reduce storage.
Retrieval : Cache hot queries, pre‑filter with metadata, and run parallel searches.
Generation : Stream output for better UX, use model distillation to replace large models, and compress prompts to retain only the most relevant context.
Profiling typically shows retrieval as the bottleneck; prioritize vector index tuning.
Question 12: Enterprise deployment challenges
Challenges span data, technology, and operations:
Data : Inconsistent document quality, access‑control requirements, and frequent updates demand cleaning, standardization, and real‑time sync.
Technical : Integration with existing systems (e.g., enterprise chat, OA), high‑availability deployment, and security/compliance (data‑in‑domain, redaction).
Operations : Continuous performance monitoring, cost control, and user training.
Success requires close collaboration among product, engineering, and ops teams.
Question 13: Core RAG principle
RAG retrieves relevant external knowledge before generation, allowing the LLM to ground its answers in factual content. This expands the model’s knowledge boundary and improves accuracy and traceability. The two‑step process (indexing → query) is now the mainstream solution for enterprise Q&A, intelligent assistants, and document analysis.
Question 14: Choosing document chunk size and top‑K
Chunk size is determined experimentally; typical sweet spot is a few hundred tokens (256‑512). Too small loses context; too large dilutes relevance. Chunking should respect paragraph or semantic boundaries.
Top‑K balances recall and cost; start with 10‑20, then tune based on test‑set Recall and Precision. Dynamic adjustment per query complexity is also common.
Question 15: What is Agentic RAG?
Agentic RAG combines an AI agent with RAG, allowing the agent to decide retrieval strategies, perform multi‑step searches, verify information, and ask clarification questions. Compared with traditional one‑shot RAG, Agentic RAG can handle more complex, multi‑hop queries at the expense of higher latency and cost.
Question 16: Benefits of hybrid RAG
Hybrid RAG mixes multiple retrieval modalities (vector + keyword, sparse + dense, internal KB + external search, multimodal). This complementary approach improves recall and precision, especially for queries like product model numbers where pure semantic search may miss exact matches. Fusion methods such as Reciprocal Rank Fusion (RRF) combine results.
Question 17: Handling document updates and incremental indexing
Two main strategies:
Full rebuild : Simple but only viable for low‑frequency updates.
Incremental update : Version documents, add new vectors, delete removed ones, and rebuild only affected partitions. Real‑time or scheduled batch updates keep the service uninterrupted, often using dual‑index switching for seamless transition.
Question 18: Why choose Milvus as the vector database?
Milvus is purpose‑built for vector search, offers a distributed architecture for billion‑scale vectors, supports hybrid scalar filtering, provides rich SDKs (Python, Java, Go) and integrates with LangChain/LlamaIndex, and can be deployed on Kubernetes for cloud‑native operations. Compared with single‑node Faiss it scales; compared with managed services like Pinecone it allows private deployment and data control.
Question 19: Difference between recall and rerank
Recall stage quickly retrieves a large candidate set using approximate nearest‑neighbor search, prioritizing high recall.
Rerank stage applies a more accurate model (e.g., Cross‑Encoder) to reorder the top candidates, achieving higher precision. The two‑stage design balances efficiency and effectiveness.
Question 20: What is an embedding?
An embedding maps high‑dimensional discrete data (text, image, audio) into a fixed‑dimensional dense vector where semantically similar items are close in space. In RAG, embeddings are used to vectorize documents and queries, enabling similarity search via cosine similarity or Euclidean distance. Common models include BGE, M3E, Text2Vec for Chinese and OpenAI’s text-embedding-ada-002 or sentence‑transformers for English.
Question 21: Similarity algorithms used in recall
Typical metrics:
Cosine similarity : Direction‑only measure, robust for text semantics; requires normalized vectors.
Euclidean distance : Straight‑line distance; sensitive to vector magnitude, often normalized.
Dot product : Fastest; combines direction and magnitude, equivalent to cosine when vectors are normalized.
Vector databases usually default to cosine similarity after vector normalization.
Question 22: Mitigating hallucinations in large models
Key techniques:
RAG grounding : Force the model to base answers on retrieved factual content.
Prompt constraints : Explicitly instruct the model to answer only from the provided context and to say "I don’t know" when uncertain.
Fact verification : Use external tools or rule‑based checks on critical statements.
Citation : Require the model to cite source documents.
Confidence scoring : Let the model output a confidence level and fall back to human review when low.
Multi‑model cross‑validation : Compare answers from independent models; flag inconsistencies.
While hallucinations cannot be eliminated entirely, these measures dramatically reduce their occurrence and improve trustworthiness.
AI Illustrated Series
Illustrated hardcore tech: AI, agents, algorithms, databases—one picture worth a thousand words.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
