Failed Alibaba Interview: The 4 RAG Modules and 6 Design Principles You Need

The article dissects a failed Alibaba second‑round interview where the candidate answered only “vector‑search‑enhanced” for a RAG design, and then presents a systematic, four‑module RAG architecture together with six design principles, detailed indexing, query understanding, multi‑path recall, and context generation techniques to help candidates demonstrate comprehensive technical depth.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
Failed Alibaba Interview: The 4 RAG Modules and 6 Design Principles You Need

Interview Background

A candidate in an Alibaba second‑round interview was asked to design a Retrieval‑Augmented Generation (RAG) system from scratch, specifying the four core modules and the design principles. The candidate answered only “use vector‑search enhancement”, which ignored the interview’s explicit requirement for four modules and design principles, leading to a rejection.

Core Lesson

Technical architecture must be tailored to the concrete scenario; merely listing advantages is insufficient.

High‑Order RAG Architecture

The complete RAG system consists of four tightly coupled engines forming a closed loop:

High‑order knowledge‑base construction engine

Query‑understanding engine (the scheduling hub)

Multi‑path recall engine (intelligent parallel search)

Intelligent context‑generation engine (the brain)

All four engines must work together; missing any component makes the system fragile.

Module 1: High‑Order Index Construction Engine

This engine is the foundation. For a knowledge base of 5,000 mixed documents (PDF, Word, scanned images, long texts), the indexing process includes three layers:

Metadata/keyword index : Extract title, timestamp, type, abstract, and keywords from each document and store them in Elasticsearch. This coarse filter reduces the candidate set from 5,000 to 50‑100 documents in milliseconds.

Vector index : Perform semantic chunking (structured semantic split, OCR‑based split for semi‑structured files, multi‑level split for long texts), embed chunks with the BGE‑M3‑Base model, and store vectors in Milvus for semantic matching.

Knowledge‑graph index : Use an LLM to discover semantic and logical relations between chunks, build a graph stored in Neo4j, and solve the “retrieval fragmentation” problem.

These three indexes cooperate rather than run sequentially: metadata quickly narrows the scope, vector search improves semantic precision, and the knowledge graph restores completeness by linking related chunks across documents.

Module 2: Query‑Understanding Engine

This engine parses user intent and translates it into retrieval instructions that respect the previously defined “index contract”. For the 5,000‑document scenario, the process includes intent detection, entity extraction, and generation of a structured query such as {filter:{doc_year:2023, doc_source:"JD"}}. The contract guarantees that the query can be satisfied by the existing indexes.

Module 3: Multi‑Path Recall Engine

The engine executes four parallel recall routes and then fuses the results:

Document‑level coarse recall using the metadata index (Elasticsearch) to shrink the candidate set.

Chunk‑level semantic recall using the vector index (Milvus) to retrieve the top‑20 most relevant chunks.

Chunk‑level keyword recall using the inverted index (Elasticsearch) to handle cases where semantics match but keywords differ.

Association‑level recall using the knowledge‑graph index (Neo4j) to fetch related chunks across documents.

After retrieval, a cross‑encoder re‑ranks all candidates, removes redundancy, and performs a hierarchical sort based on combined scores, association priority, and document priority, achieving both speed and accuracy.

Module 4: Intelligent Context‑Generation Engine

This engine acts as the system’s brain. It uses precise prompts, hallucination suppression, and multi‑turn dialogue stitching to generate coherent answers while preserving factual correctness.

Six “Eliminate‑Hallucination” Core Solutions

1. Chunk Size & Online Recall Budget

Allocate the LLM’s context window (e.g., 8K tokens) into fixed costs (system prompts, dialogue history) and a variable budget for recalled chunks. For example, with 2K tokens reserved for prompts, 6K tokens remain for retrieved material, guiding the number and size of chunks.

2. Index‑Query Contract

Define an explicit contract during design that lists required index fields (e.g., doc_year, doc_source). The query‑understanding module must generate queries that conform to this contract, ensuring alignment between supply and demand.

3. Decouple Intent Recognition from Result Fusion

Treat the query‑understanding module as an “advisor” that provides intent suggestions, while the fusion layer acts as the “decision maker”, aggregating signals from multiple recall paths and the advisor’s weights to produce a robust final ranking.

4. Push Structured Context to the LLM

Move part of the information‑organization work from the LLM to the recall stage, delivering a semi‑structured context draft that reduces the LLM’s cognitive load.

5. Use Knowledge Graph as Fact‑Check

Leverage the deterministic knowledge graph built during indexing to verify the LLM’s non‑deterministic outputs, preventing hallucinations by enforcing factual boundaries.

6. Global Dialogue State Management

Maintain a lightweight, structured dialogue‑state object (separate from the LLM’s short‑term memory) that records the current entity, focus, and referenced knowledge. The context‑generation engine updates this state, and the query‑understanding engine reads it to enhance subsequent queries.

Interview‑Ready Narrative

When asked to design a high‑order RAG system for 5,000 documents, the candidate should first outline the overall challenge and architecture, then explain how the six design principles permeate each of the four modules, and finally demonstrate how the system balances precision, latency, and completeness under resource constraints.

Common Pitfalls and How to Avoid Them

Listing technologies without top‑level resource planning – instead, compute the LLM’s token budget and allocate it to chunk size and recall count.

Treating index and query as isolated – enforce an index‑query contract to guarantee alignment.

Relying on a single intent path – use a decision‑layer that can fall back on other signals when intent detection fails.

Using the strongest models everywhere – balance cost and latency by selecting performant yet affordable models (e.g., BGE‑M3 for embeddings, lightweight LLMs for batch processing, Cross‑Encoder only for complex queries).

Conclusion

Designing a high‑order RAG system requires moving from a “module‑assembly” mindset to a “principle‑driven integration” mindset. By adhering to the four core engines, the six design principles, and the detailed processes described above, candidates can showcase deep architectural insight and avoid the common traps that cause interview failures.

RAGVector SearchRetrieval-Augmented GenerationDesign Principlesknowledge graphAI ArchitectureMulti‑Path Recall
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.