40 Must‑Know GenAI Interview Questions: From RAG Pipelines to Multi‑Agent Orchestration

This comprehensive guide compiles 40 senior‑level GenAI interview questions covering LLM fundamentals, retrieval‑augmented generation, prompt engineering, multi‑agent orchestration, fine‑tuning, evaluation, system design, NL‑to‑SQL, and knowledge‑graph retrieval, providing concise, accurate answers and practical trade‑off insights.

AI Waka
AI Waka
AI Waka
40 Must‑Know GenAI Interview Questions: From RAG Pipelines to Multi‑Agent Orchestration

Part 1: LLM Fundamentals

Q1. What is the difference between a Base Model and an Instruction‑tuned Model? A base model is trained on massive corpora for next‑token prediction and can complete text but does not reliably follow instructions. Instruction‑tuned models (e.g., GPT‑4, Claude) undergo further fine‑tuning on curated instruction‑response pairs (often via RLHF or RLAIF) to align outputs with user intent. In production, instruction‑tuned variants are almost always preferred unless you need a highly specialized full‑model fine‑tune.

Q2. Explain the attention mechanism in Transformers and why it is crucial for LLMs. Attention lets each token weigh all other tokens in the sequence, computing a weighted sum of value vectors. The core innovation is learning query‑key dot‑product scores, enabling efficient capture of long‑range dependencies that RNNs cannot handle. Self‑attention empowers LLMs to resolve coreference, track context across thousands of tokens, and perform multi‑step reasoning.

Q3. What is a context window and what challenges arise with large windows? The context window is the maximum number of tokens a model can process in a single forward pass. Large windows (e.g., 128k+ in GPT‑4o, Claude 3.7) improve context learning but incur quadratic O(n²) attention cost, stressing memory and compute. Models also exhibit a “Lost in the Middle” problem where information in the middle of long contexts is less likely to be retrieved accurately.

Q4. What is temperature and how does it affect generation? Temperature scales logits before the softmax. At 0 the model always picks the highest‑probability token (greedy). At 1 the distribution is unchanged. Values >1 flatten the distribution, making output more random. Low temperature (0.0‑0.3) is best for factual tasks; higher temperature (0.7‑1.0) suits creative generation.

Q5. How do Top‑k and Top‑p (nucleus) sampling differ? Top‑k limits sampling to the k most probable tokens. Top‑p selects the smallest set of tokens whose cumulative probability exceeds p, dynamically adapting to distribution entropy. Top‑p is generally preferred because it balances diversity and relevance across varying entropy conditions.

Part 2: Retrieval‑Augmented Generation (RAG)

Q6. What problem does RAG solve and what are its core components? LLMs have fixed knowledge cut‑offs and can hallucinate facts. RAG grounds generation in retrieved documents, combining language ability with up‑to‑date or domain‑specific knowledge. Core components: (1) document ingestion pipeline with chunking and embedding, (2) vector database for similarity search, (3) retriever, and (4) LLM generator that synthesizes answers from retrieved context.

Q7. How should you choose a chunking strategy? It depends on document type and query nature. Fixed‑size chunks (e.g., 512 tokens with 50‑token overlap) are simple but ignore semantic boundaries. Semantic chunking groups sentences by embedding similarity. Hierarchical chunking creates parent‑child relationships, allowing retrieval of small chunks while sending parent chunks for full context. For legal or structured docs, structure‑aware chunking that respects headings usually outperforms pure token‑based methods.

Q8. What is hybrid search and when does it outperform pure vector search? Hybrid search combines dense (vector) retrieval with sparse (BM25/TF‑IDF) retrieval, then re‑ranks using Reciprocal Rank Fusion (RRF) or a learned re‑ranker. Pure vector search excels at semantic similarity but struggles with exact keyword queries (e.g., product codes). In enterprise settings where queries mix semantic and keyword intent, hybrid search yields superior results.

Q9. Explain the difference between a reranker and a bi‑encoder. A bi‑encoder encodes query and document independently into fixed vectors and computes similarity via dot product—fast but coarse. A reranker (cross‑encoder) concatenates query and document, applying cross‑attention to score relevance—slow but much more accurate. Best practice: use a bi‑encoder for initial large‑scale retrieval, then apply a cross‑encoder reranker on the top‑k candidates.

Q10. How do you evaluate a RAG pipeline? Use the RAGAS framework, which assesses four dimensions: (1) Faithfulness – does the answer’s claim stem from retrieved context? (2) Answer Relevance – does the answer actually solve the question? (3) Context Precision – are the retrieved passages relevant? (4) Context Recall – does the retrieved context contain the needed information? In production, faithfulness and context precision are the most critical metrics for catching hallucinations and retrieval drift.

Q11. What is the “Lost in the Middle” issue in RAG? Studies show LLMs preferentially use information at the beginning or end of the context window, ignoring middle portions. Mitigations include re‑ordering chunks to place the most relevant ones first, using “boundary token padding,” or reducing the number of retrieved chunks.

Q12. What failure modes do native RAG pipelines exhibit in production? (1) Mismatched chunk granularity – chunks too large dilute signal, too small lose context. (2) Embedding model mismatch with query domain. (3) Missing re‑ranking – highest cosine similarity does not guarantee relevance. (4) Absence of guardrails for irrelevant queries – LLM may hallucinate. (5) Lack of citation tracking – answers cannot be audited. Explicit mitigations are required for each in a production system.

Part 3: Prompt Engineering

Q13. What is Chain‑of‑Thought (CoT) prompting and when is it helpful? CoT prompts ask the model to reason step‑by‑step before giving a final answer. This improves performance on arithmetic, multi‑step reasoning, and tasks requiring intermediate logic, especially for larger models (7B+ parameters). For simple classification or retrieval tasks, CoT adds latency without benefit.

Q14. Contrast zero‑shot, few‑shot, and fine‑tuning approaches. Zero‑shot provides only a task description. Few‑shot includes 2‑8 demonstration examples in the prompt. Fine‑tuning adjusts model weights on a task‑specific dataset. In most classification/extraction tasks, few‑shot can close 70‑80% of the gap between zero‑shot and full fine‑tuning. Fine‑tuning is justified when you need consistent formatting, domain‑specific terminology, or sub‑100 ms latency at scale.

Q15. What is prompt injection and how can you defend against it? Prompt injection occurs when user input overwrites system instructions (e.g., “ignore previous instructions”). Defenses: (1) Use XML tags or special tokens to clearly separate user input from system prompts. (2) Validate and sanitize user input. (3) Run a separate LLM classifier before the main chain to detect malicious inputs. (4) Apply output verification to catch unexpected behavior. Prompt injection is a primary security vulnerability in LLM applications.

Part 4: Multi‑Agent Systems

Q16. What is a multi‑agent LLM system and when should you use it? Multi‑agent orchestration connects several LLM‑driven agents, each with a specialized role (researcher, coder, reviewer, etc.) to collaboratively solve complex tasks. Use it when (1) the task exceeds a single context window, (2) parallel execution yields speed gains, or (3) specialization and verification are needed (e.g., one agent writes code, another tests it). Frameworks such as CrewAI, LangGraph, and AutoGen provide orchestration primitives.

Q17. Compare the ReAct and Plan‑and‑Execute agent architectures. ReAct interleaves reasoning and action in a single loop: think → act → observe → repeat. It works well for simple, single‑step tool usage. Plan‑and‑Execute separates planning from execution: a planner first generates a full task plan, then subordinate agents carry out each step. This architecture better handles long‑horizon tasks because the overall goal remains visible throughout execution.

Q18. How do you prevent agents from entering infinite loops? (1) Impose hard limits on iteration or tool‑call counts. (2) Implement loop‑detection heuristics (e.g., repeated state or action sequences). (3) Use a supervisory agent to monitor sub‑agents and intervene. (4) Design termination criteria explicitly in the task‑decomposition prompt. (5) Set budget constraints on token usage or API calls.

Q19. How do agents share state in LangGraph? LangGraph represents the workflow as a directed graph; state is passed via a shared TypedDict object. Each node (agent or tool) reads and writes to this state, and conditional edges route execution based on state values. Sub‑graphs can isolate their own state schemas, and messages across graph boundaries prevent cross‑contamination.

Q20. What role does memory play in agent systems? Agents use several memory types: (1) Context memory – the current conversation or task history within the context window. (2) External situational memory – a vector store of past interactions, retrieved on demand. (3) Semantic memory – a factual knowledge base, often the RAG corpus. (4) Procedural memory – learned workflows or tool‑use patterns, sometimes stored as fine‑tuned weights. Managing what resides in the context window versus external memory is a key performance lever.

Part 5: Fine‑Tuning & Alignment

Q21. When should you choose fine‑tuning over RAG? Use RAG when knowledge must be updatable, auditable, or when you want factual grounding without retraining. Choose fine‑tuning when you need consistent output format or style, the task requires skills absent from the base model, latency is critical and you cannot afford a retrieval step, or you have >1,000 high‑quality labeled examples. In practice, a hybrid architecture—fine‑tuning for format/style and RAG for factual grounding—often yields the best results.

Q22. What is LoRA and why is it preferable to full‑parameter fine‑tuning? Low‑Rank Adaptation (LoRA) freezes the pretrained weights and injects trainable low‑rank matrices into each Transformer layer. This reduces trainable parameters by roughly 10,000×, enabling 7B+ models to be fine‑tuned on a single GPU. QLoRA further quantizes weights to 4‑bit, making 65B models fine‑tunable on a 48 GB GPU. LoRA/QLoRA deliver >90% of full‑fine‑tuning performance at a fraction of the cost.

Q23. What is RLHF and what are its known limitations? Reinforcement Learning from Human Feedback trains a reward model on human preference data, then uses PPO to optimize the LLM. It produces today’s most aligned models. Limitations include (1) reward hacking – the model learns to game the reward model, (2) annotator bias and disagreement, (3) expensive and slow iteration cycles, and (4) distribution collapse if KL penalties are insufficient. Direct Preference Optimization (DPO) is emerging as a simpler alternative that avoids explicit RL.

Part 6: LLM Evaluation & Observability

Q24. How do you measure LLM output quality in production? Combine (1) automated LLM‑as‑judge scoring (using a stronger model to grade outputs), (2) task‑specific metrics (e.g., ROUGE/BLEU for summarization, F1 for extraction, pass@k for code), (3) RAGAS metrics for retrieval‑augmented pipelines, and (4) human evaluation for high‑risk use cases. Track metric distributions over time; sudden shifts often signal emerging issues.

Q25. What is LLM‑as‑Judge and what failure modes does it have? A powerful LLM (e.g., GPT‑4, Claude 3 Opus) judges the outputs of a weaker model. It correlates highly with human judgments at scale. Failure modes: (1) Position bias – preferring the first option in pairwise comparisons, (2) Self‑enhancement bias – rating its own outputs higher, (3) Length bias – longer outputs receive higher scores regardless of quality. Mitigations include randomizing option order, using ensemble judges, and normalizing for length.

Q26. Which observability tools are commonly used for LLM systems? LangSmith (for LangChain), Phoenix (Arize), Langfuse, and PromptLayer are popular. Key signals to log: prompt text, completion, token usage, latency, tool‑call results, retrieval scores, and final quality scores. Set automated alerts for latency spikes, high token consumption, or drops in LLM‑as‑judge scores.

Part 7: Architecture & System Design

Q27. Design a production‑grade RAG system for a corporate knowledge base of 10,000 documents. Critical choices: (1) Ingestion – use structure‑aware chunking (LlamaParse or Unstructured.io) on PDFs/Docx, embed with text‑embedding‑3‑large or a fine‑tuned E5 model. (2) Storage – select Pinecone, Weaviate, or pgvector based on scale. (3) Retrieval – hybrid search (BM25 + dense) with cross‑encoder re‑ranking. (4) Generation – LLM with citation‑aware structured output; fall back to “I don’t know” when context score falls below a threshold. (5) Observability – instrument queries in LangSmith and alert on faithfulness score drops.

Q28. How to handle multi‑tenant isolation in a RAG system? Use namespaces or metadata filtering at the vector‑database level (e.g., Pinecone namespaces or Weaviate tenant filters). Include a tenant_id metadata field and enforce it in every query. For highly sensitive data, provision separate vector indexes per tenant. Combine with API‑level RBAC for added security.

Q29. Strategies to reduce LLM latency in a production chatbot. (1) Streaming – send tokens to the client as they are generated. (2) Model size matching – route simple queries to a smaller, faster model, and complex ones to a larger model. (3) Prompt caching – reuse system‑prompt prefixes (supported by Anthropic and OpenAI). (4) Asynchronous tool calls – run independent tools in parallel. (5) Quantized inference – deploy 4‑bit or 8‑bit models for on‑premise serving.

Q30. What are guardrails in LLM applications? Guardrails are validation layers before and after LLM calls to enforce safety and policy compliance. Input guardrails include topic classifiers, PII detectors, and injection detectors. Output guardrails include hallucination detectors, format validators, and toxicity filters. Frameworks such as NeMo Guardrails and Guardrails AI provide ready‑made components, especially critical in regulated domains like healthcare or finance.

Part 8: NL‑to‑SQL & Structured Data

Q31. Main failure modes of NL‑to‑SQL systems. (1) Schema ambiguity – different tables use different names for the same concept. (2) Complex joins – LLM mis‑reasoning on multi‑table joins. (3) Hallucinated column/table names. (4) Incorrect aggregation (e.g., SUM vs COUNT). (5) Date/time arithmetic errors. Mitigations: provide schema examples, few‑shot demonstrations, a SQL validation layer, and execution‑feedback loops that let the LLM self‑correct.

Q32. How to handle very large schemas in NL‑to‑SQL? Use a two‑stage retrieval: first retrieve the most relevant tables based on the user query (using embeddings), then pass only those tables’ DDL and sample rows to the LLM for generation. Tools like Vanna.ai implement this pattern. For extremely large schemas, train a schema‑understanding module on query‑schema pairs.

Part 9: Knowledge Graphs & Structured Retrieval

Q33. When should you use a knowledge graph instead of a vector database? Knowledge graphs excel when (1) inter‑entity relationships are semantically important, (2) multi‑step reasoning across entities is required, or (3) precise, auditable provenance is needed. Microsoft’s GraphRAG combines graph traversal with vector search to excel on community‑level, multi‑entity queries.

Q34. What is GraphRAG and how does it differ from standard RAG? GraphRAG builds an entity‑relationship graph during indexing. At query time it retrieves relevant graph communities rather than individual chunks, then summarizes the community and feeds it as context. This improves performance on global, integrative questions but adds indexing complexity and higher computational cost.

Part 10: Concept Quickfire

Q35. Difference between semantic search and keyword search. Semantic search uses dense embeddings to find conceptually similar content, handling synonyms and paraphrases. Keyword (BM25) search matches exact or stemmed terms, excelling at proper nouns and precise identifiers.

Q36. What is function calling in LLMs? The model returns a structured JSON object specifying the function to invoke and its arguments, instead of (or in addition to) free‑form text. This enables reliable tool usage and underpins most agent frameworks.

Q37. What is Constitutional AI? Proposed by Anthropic, it aligns models by training them to critique and revise their own outputs against a set of principled “constitution” rules, reducing reliance on human labelers for harmful content.

Q38. What is speculative decoding? A latency‑reduction technique where a small draft model generates multiple tokens in parallel, and a larger target model validates them in a single forward pass. This can accelerate long‑form generation by 2‑3×.

Q39. Main differences among LangChain, LlamaIndex, and LangGraph. LangChain is a general‑purpose LLM application framework with many integrations. LlamaIndex focuses on data ingestion, indexing, and RAG pipelines. LangGraph provides a low‑level graph execution engine for stateful, multi‑step agent workflows, offering fine‑grained control over state and flow.

Q40. Difference between online and offline LLM evaluation. Offline evaluation runs on fixed benchmark datasets—fast, reproducible, but may not reflect production distribution. Online evaluation samples live traffic and uses LLM‑as‑judge scoring to capture real‑world user behavior—more representative but slower to iterate. Best practice: maintain a curated offline regression suite and continuously monitor online metrics after deployment.

Conclusion

The 40 questions above capture the core knowledge areas senior GenAI and LLM engineers are expected to master today. The field evolves rapidly—architectures that were cutting‑edge in 2023 (e.g., native RAG) are now baseline, and interviewers now probe deeper into agent memory, GraphRAG, and multi‑modal pipelines. Success hinges not only on definitions but on the ability to weigh trade‑offs: when to pick hybrid search over pure vector search, when fine‑tuning beats prompt engineering, or why LangGraph may be preferable to LangChain for a given workflow.

LLMprompt engineeringRAGInterview preparationmulti-agent systemsGenAI
AI Waka
Written by

AI Waka

AI changes everything

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.