RAG Series Recap: From Chunking to Prompt – A Complete Technical Roadmap

This article systematically reviews the nine‑stage RAG pipeline—from data cleaning and text chunking through embedding, vector indexing, retrieval, reranking, and finally prompt assembly—highlighting core concepts, practical code snippets, common pitfalls, and optimization tips for building production‑grade systems.

AI Architect Hub
AI Architect Hub
AI Architect Hub
RAG Series Recap: From Chunking to Prompt – A Complete Technical Roadmap

Series Overview: Full RAG Process Map

The nine stages form a complete RAG workflow: data cleaning → text chunking → embedding → index building → retrieval + rerank → prompt assembly. The ASCII diagram below visualizes the chain.

┌─────────────────────────────────────────────────────────────────────┐
│                     RAG完整流程                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                               │
│   数据源 ──► 第1关 ──► 第2关 ──► 第3关 ──► 第4关 ──► 第5‑8关 ──► 第9关 │
│        数据清洗   文本分块   向量化   索引构建   召回+Rerank  Prompt组装 │
│                                                               │
└─────────────────────────────────────────────────────────────────────┘

Why This Order?

Many tutorials start with high‑level concepts (Transformer architecture, BERT attention) which leads to repeated pitfalls when students build projects. RAG is a system‑level engineering problem; a failure in any link collapses the whole pipeline. Therefore the author designs the curriculum from the most easily overlooked component to the most complex, progressing layer by layer.

Stage 1‑2 (Input side) : Data quality determines the system ceiling.

Stage 3‑4 (Storage side) : Vectorization and indexing are the retrieval foundation.

Stage 5‑8 (Query side) : Retrieval strategy and rerank decide whether the right document is found.

Stage 9 (Output side) : Prompt assembly determines the final answer quality.

Core Technical Points per Stage

Stage 1 – Data Cleaning: "Garbage In, Garbage Out"

Key knowledge :

Four categories of dirty data: format (headers/footers, watermarks), structure (HTML tags), content (garbled or duplicate text), business (ads, navigation).

Progressive cleaning pipeline: remove HTML → clean wiki elements → normalize blanks → handle special characters → delete template text.

Encoding traps: mixed UTF‑8/GBK, BOM markers, zero‑width spaces can corrupt the whole system.

Pitfall example (Wiki export cleaning): navigation buttons, edit history, and copyright info mix into the main text. Solution uses a regex + BeautifulSoup:

# 真实案例:Wiki导出文档的清洗
# 问题:导航按钮、编辑历史、版权信息混入正文
# 解决:正则匹配 + BeautifulSoup解析
pattern = r'编辑\s*\|\s*删除\s*\|\s*讨论\s*\|\s*历史'

Personal insight : Many teams spend weeks tuning embeddings and prompts, only to discover the root cause was poor data. Data cleaning is mandatory, not optional.

Stage 2 – Text Chunking: "If you cut poorly, semantics break"

Chunking must be adapted to the domain; there is no universal best practice.

Chunking strategies :

Fixed length – fast prototyping.

Recursive character – works for generic text.

Document‑structure aware – for Markdown/HTML.

Code‑syntax aware – for source files.

Core parameters : chunk_size: 500‑700 characters for Chinese; too large loses precision, too small loses context. chunk_overlap: 50‑100 characters to keep semantic continuity across chunks.

Chinese punctuation 。!? should be treated as sentence delimiters.

Pitfall example : Using RecursiveCharacterTextSplitter with separators \n\n, \n, 。, !, ?, fixes the split of "生成能力" into "生成能" + "力".

Personal insight : When processing legal contracts with a 500‑character fixed chunk, the phrase "违约金上限30%" was split from "违约金计算公式", causing the retrieval to miss the complete answer.

Stage 3 – Embedding: "Turning Text into Numbers"

Evolution: One‑Hot → Word2Vec → BERT (context‑aware).

Sentence‑level embedding: Mean pooling is more stable than using the CLS token.

Similarity metric: Cosine similarity focuses on direction, ignoring vector length.

Model comparison (Chinese MTEB scores) :

BGE‑large‑zh – 63.2 score, 40 ms latency – preferred for highest accuracy.

BGE‑M3 – 64.8 score, 50 ms latency – good for multilingual and long texts.

m3e‑base – 60.5 score, 35 ms latency – suitable for latency‑sensitive scenarios.

Pitfalls :

Higher dimensions are not always better (3072‑dim OpenAI model underperforms 1024‑dim BGE on Chinese).

Vector dimension must match the index (Chroma defaults to 1536 dim; using 1024‑dim BGE throws errors).

Batch size 16‑32 balances speed and memory; larger batches cause OOM.

Personal insight : OpenAI embeddings are popular, but in Chinese tasks BGE series consistently outperforms them. Start with open‑source models as a baseline.

Stage 4 – Vector Indexing: "Store it right, find it fast"

Without an index, a 100 k‑vector collection would require linear scan.

Index types :

Flat – brute‑force, <10 k vectors, 100 % recall.

IVF – inverted file clustering, suitable for >100 k vectors, 90‑99 % recall.

HNSW – hierarchical graph, 10 k‑5 M vectors, 95‑99 % recall.

Key parameters :

HNSW: M (16 or 32) controls edge count; larger M = higher accuracy, slower speed.

HNSW: efConstruction =200 for index building, efSearch ≥64 for production queries.

IVF: nlist =√(vector count) for bucket count; nprobe = 1‑10 % of nlist for query buckets.

Pitfalls :

IVF must be trained first: index.train(vectors) cannot be omitted.

HNSW default efSearch =16 yields only ~85 % recall; increase to ≥64.

Choosing the wrong index type (Flat for 500 k vectors) leads to severe latency.

Personal insight : Teams often keep the default HNSW parameters, resulting in 85 % recall. Proper tuning can push recall to 99 %.

Stage 5 – Semantic Retrieval: "Let AI understand the user"

Semantic retrieval vs. keyword retrieval: the former captures intent, the latter merely matches tokens.

Three query‑rewriting strategies: synonym expansion, HyDE (hypothetical document generation), query decomposition.

Pre‑processing pipeline: text cleaning → intent detection → key‑info extraction → rewrite/expand.

Pitfall table (issues → solutions):

Query too short → insufficient information → expand query.

Ambiguity (e.g., "Apple") → add intent detection + context.

Colloquial noise (e.g., "就是那个啥我想退一下") → extract key information.

Negation (e.g., "不要苹果手机") → detect negation words.

Personal insight : Many retrieval failures stem from poor query handling, not from the embedding model.

Stage 6 – Vector Store Selection: "Choosing the wrong store costs dearly"

Chroma : lightweight, zero‑config, LangChain native; not suitable for >500 k vectors (memory explosion).

Milvus : industrial‑grade, distributed, cloud‑native; higher deployment complexity; fits >1 M vectors.

Qdrant : Rust‑based, <10 ms latency; smaller community; good for performance‑sensitive, medium‑scale workloads.

Weaviate : GraphQL API, hybrid search; resource‑heavy; ideal for rapid prototyping with mixed retrieval.

Pitfalls :

Chroma cannot handle production‑scale data; memory blows up after 500 k vectors.

Choosing a Rust‑based store (Qdrant) without a strong community makes debugging hard.

Failing to plan for scalability forces a costly migration (e.g., moving from Chroma to Milvus).

Personal insight : A team that built a 500 k‑vector Chroma store had to rewrite the entire storage layer when the dataset grew to 5 M vectors.

Stage 7 – Retrieval Strategies: "Finding a needle in a haystack"

Top‑K : retrieve the K most similar vectors; fast, works for clear‑cut differences.

MMR : balances relevance and diversity; essential when many documents are semantically similar.

Hybrid search : combines vector similarity with keyword matching for precise term coverage.

RRF fusion formula (Reciprocal Rank Fusion): RRF_score = Σ 1/(k + rank_i) Documents ranked high by multiple systems receive higher scores. Pitfalls :

Too many candidates increase latency; balance K and threshold.

Not every scenario needs rerank; simple vector search suffices for straightforward queries.

Rerank model must match the embedding model (e.g., BGE‑Reranker with BGE embeddings).

Personal insight : Adjusting MMR dramatically improves answer diversity compared with plain Top‑K.

Stage 8 – Rerank: "Coarse retrieval then fine‑grained ranking"

Bi‑Encoder : encodes query and document separately; offline computation, fast online.

Cross‑Encoder : encodes query‑document pair together; slower but more accurate.

Popular rerank models: BGE‑Reranker, Cohere Rerank, ColBERT.

Two‑stage pipeline: vector search (top 100) → rerank (top 10) → LLM generation.

Pitfalls :

Rerank adds latency; need to trade off precision vs. speed.

Simple queries may skip rerank entirely.

Rerank model should be compatible with the embedding model.

Personal insight : Rerank acts as "precision guidance"—it narrows 100 candidates to the 10 most relevant, a standard practice for industrial systems.

Stage 9 – Prompt Assembly: "The final mile that decides success"

Even with perfect retrieval, a malformed prompt yields irrelevant answers.

Four essential elements :

System prompt – role definition, behavior rules, output format.

Context – retrieved documents, source citations, reference IDs.

User question – original query and explicit requirements.

Output format – structured response with citation markers.

┌────────────────────────────────────────┐
│ 1️⃣ System prompt: role, rules, format │
│ 2️⃣ Context: docs, source, citation   │
│ 3️⃣ User question: original query      │
│ 4️⃣ Output format: structured, cited   │
└────────────────────────────────────────┘

Context management strategies (by token budget):

Simple fact lookup – direct concatenation, ≤ 1000 tokens.

Comparative analysis – split into chunks, each ≤ 500 tokens.

Complex analysis – summarize first, then attach detailed appendix; allocate tokens as needed.

Hallucination detection (three‑layer check) :

Explicit citation check – does the answer reference the provided context?

Implicit consistency check – does the claim contradict the context?

Confidence assessment – is the uncertainty reasonable?

Pitfalls :

Prompt exceeds model window – enforce token budget, truncate dynamically.

Context conflict – when multiple documents disagree, explicitly label the chosen source.

Inconsistent citation format – use uniform numbering and guide the LLM to follow it.

Personal insight : Prompt design can swing performance dramatically; the same retrieved set can produce a perfect answer or a nonsensical one depending on assembly.

Codebase Summary

The nine‑stage code layout (each file corresponds to a stage) provides reusable components:

rag_system/
├── 第1关_document_cleaner.py      # Data cleaning pipeline
├── 第2关_text_splitter_demo.py   # Various chunking strategies
├── 第3关_embedding_comparison.py # Embedding model benchmark
├── 第4关_vector_index_builder.py # Index construction & tuning
├── 第5关_semantic_query.py       # Query preprocessing
├── 第6关_vector_store_compare.py # Vector store comparison
├── 第7关_retrieval_strategies.py # Retrieval implementations
├── 第8关_rerank_pipeline.py      # Rerank workflow
└── 第9关_prompt_assembler.py     # Prompt construction

Key reusable components:

DocumentCleaner – progressive cleaning with custom rules.

AdaptiveTextSplitter – selects chunking strategy based on document type.

HybridRetriever – combines vector, keyword, and RRF fusion.

CitationTracker – manages source IDs and displays them.

HallucinationDetector – evaluates answer confidence.

A minimal pipeline (5 lines) can run end‑to‑end:

# 5行代码跑通RAG
docs = DocumentLoader.load("知识库")
chunks = AdaptiveTextSplitter.split(docs)
vectors = BGEEmbedder.encode(chunks)
index = VectorIndexBuilder.build(vectors, type="hnsw")
results = index.search(query, top_k=5)
answer = PromptAssembler.assemble(results, query, llm)

Learning Paths

Beginner (0 foundation) : Run the full pipeline with provided code, then dive into each stage.

Intermediate (some background) : Fill gaps, focus on performance tuning (index, store, rerank).

Expert (senior) : Build evaluation metrics, track state‑of‑the‑art (GraphRAG, Agentic RAG), share knowledge.

Future Outlook

GraphRAG : Adds knowledge‑graph reasoning for multi‑hop questions.

Agentic RAG : Gives the system autonomous decision‑making (when to retrieve, how many iterations).

End‑to‑end joint training : Retrieval learns what documents help generation; generator learns to leverage retrieved info.

Multimodal RAG : Extends beyond text to images, video, and tables.

Conclusion

Completing all nine stages equips you with a production‑grade RAG system. Remember: "Garbage in, garbage out" and "Prompt assembly matters" are the two immutable laws. Continuous iteration and systematic optimization are the only path to reliable AI‑augmented applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMPrompt EngineeringRAGEmbeddingVector Indexing
AI Architect Hub
Written by

AI Architect Hub

Discuss AI and architecture; a ten-year veteran of major tech companies now transitioning to AI and continuing the journey.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.