Boosting RAG Performance with Milvus: Chunking, Hybrid Search, and Rerank Best Practices
This article analyzes why Retrieval‑Augmented Generation often underperforms, then walks through concrete engineering steps—optimal chunking, overlap settings, hybrid vector + BM25 retrieval, RRF fusion, and reranking—while providing code snippets, parameter tables, and a full pipeline diagram to turn a usable RAG system into a high‑quality one.
1. Why RAG Performance Lags
Many users spend time tweaking prompts only to see the large language model answer incorrectly; the root cause lies in the retrieval stage. The RAG flow can be visualized as:
用户问题
│
▼
[ 检索层 ]
│
├── 分块质量差 → 上下文残缺 → 召回无关块
├── 只用向量检索 → 关键词命中率低
├── 没有 Rerank → 排名靠后的相关块被截断
│
▼
[ 生成层 ]
│
└── 拿到残缺/无关上下文 → 答案偏差/幻觉Three core problems lead to three corresponding solutions, which are covered in the following sections.
2. Chunking Strategy: The Foundation
Chunking (splitting a long document into smaller pieces) directly determines the upper bound of recall. An unintuitive fact highlighted by the author is:
Chunk too large ≠ more information; chunk too small ≠ more accurate retrieval.
块太大(1000+ tokens):
┌──────────────────────────────────────┐
│ 无关噪声 │ 相关内容 │ 无关噪声 │
└──────────────────────────────────────┘
↓ 向量被稀释,相关度下降
块太小(50 tokens):
┌────┐┌────┐┌────┐
│ A1 ││ A2 ││ A3 │ 一个完整语义被切断
└────┘└────┘└────┘
↓ 上下文缺失,LLM 看不懂
合理块(200‑500 tokens + 20% overlap):
┌──────────────┐
│ 完整语义单元 │ 带20%重叠确保边界不丢
└──────────────┘
↓ 召回精准,上下文完整The most widely used engineering approach is the Parent‑Child chunking pattern, which stores small child chunks for precise retrieval while keeping larger parent chunks to provide full context.
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { InMemoryStore } from "@langchain/core/stores";
// Parent chunk: large block, ensures complete context (2000 tokens)
const parentSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 2000,
chunkOverlap: 200,
});
// Child chunk: small block for precise retrieval (200 tokens)
const childSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 200,
chunkOverlap: 20,
});
// Store parent chunks (docstore) – can be in‑memory or Redis/MongoDB
const docstore = new InMemoryStore();
// Vectorize child chunks and load into a vector store
const vectorStore = await MemoryVectorStore.fromDocuments([], embeddings);
// Parent‑Child retriever: retrieve child chunks, automatically return the matching parent chunk
const retriever = new ParentDocumentRetriever({
vectorstore: vectorStore,
docstore: docstore,
parentSplitter: parentSplitter,
childSplitter: childSplitter,
});
await retriever.addDocuments(docs);
const results = await retriever.invoke("你的问题");
console.log(results[0].pageContent); // Returns the full parent blockFour mainstream chunking strategies are compared below:
Fixed‑size – generic text; simple and fast; may cut semantic boundaries.
Recursive (character) – structured documents; respects natural boundaries; requires parameter tuning.
Semantic – knowledge‑dense documents; preserves semantics; computationally expensive.
Parent‑Child – long documents with precise retrieval; combines accurate recall with full context; implementation is more complex.
3. Overlap Settings: Keeping Boundaries Intact
Many overlook the overlap (sliding window) parameter, yet it is the lifeline for sentences that span across chunks.
不设 overlap:
块1:[…Apple 今年发布了新款 MacBook Pro,搭载 M3 芯片]
块2:[…性能相比上代提升 40%,续航延长至 22 小时...]
↑ "性能提升 40%" 与 M3 芯片的上下文被切断!
设置 overlap=100:
块1:[…Apple 今年发布了新款 MacBook Pro,搭载 M3 芯片]
块2:[搭载 M3 芯片,性能相比上代提升 40%,续航延长至 22 小时...]
↑ 重叠部分把语境带过来了Practical recommendations:
Chinese prose: overlap = chunkSize × 10%‑15% Technical documentation (including code): overlap = chunkSize × 20% Pure code blocks: split at function or class boundaries instead of hard cuts.
4. Hybrid Retrieval: Vector + BM25
Pure vector search struggles with exact keyword matching. For a query like “LangChain v0.3 breaking change”, vector similarity finds semantically related documents but misses the precise version token, whereas BM25 captures the exact term.
纯向量检索:
查询: "LangChain v0.3 breaking change"
→ 向量空间计算语义相似度
→ 召回 "LangChain 更新内容"(正确)
但漏掉明确标注 "v0.3" 的文档(漏召回)
纯 BM25 检索:
→ 精确命中 "v0.3" 关键词(正确)
但漏掉语义相近但表述不同的文档
混合检索(Hybrid Search):
向量结果 ──┐
├── RRF 融合 ─→ 最终排序
BM25 结果 ──┘
↑ 两者互补,召回率最大化LangChain’s EnsembleRetriever implements this fusion succinctly:
import { BM25Retriever } from "@langchain/community/retrievers/bm25";
import { EnsembleRetriever } from "langchain/retrievers/ensemble";
// Vector retriever (Milvus / Chroma, etc.)
const vectorRetriever = vectorStore.asRetriever({ k: 10 }); // take more for later rerank
// BM25 keyword retriever
const bm25Retriever = BM25Retriever.fromDocuments(docs, { k: 10 });
// Hybrid retriever with RRF fusion (weights can be tuned per scenario)
const ensembleRetriever = new EnsembleRetriever({
retrievers: [vectorRetriever, bm25Retriever],
weights: [0.6, 0.4], // default: vector 60%, BM25 40%
});
const results = await ensembleRetriever.invoke("LangChain v0.3 breaking change");
// Returns the merged ranking after RRF re‑scoringThe RRF (Reciprocal Rank Fusion) algorithm works as follows:
RRF(D) = 1/(k + r1) + 1/(k + r2) (k = 60 smoothing parameter)
Example:
Vector rank = 2, BM25 rank = 1 → RRF ≈ 0.0323
Vector rank = 8, BM25 rank = 3 → RRF ≈ 0.0306
Documents that are top in both lists obtain the highest RRF score and move to the front.5. Rerank: The Second‑Stage Filter
Hybrid retrieval expands recall, but the LLM’s context window can only accept a few chunks (3‑5). Rerank models re‑score the retrieved documents using cross‑attention, achieving far higher precision than raw vector distances.
没有 Rerank 的问题:
向量检索 Top‑5:
✅ 文档 A(最相关,排第3)
❌ 文档 B(相似但无关,排第1)
❌ 文档 C(语义相近但答不了,排第2)
✅ 文档 D(相关,排第4)
❌ 文档 E(无关,排第5)
LLM 只拿到 Top‑3 → B、C、A → 前两个是干扰!
加了 Rerank 之后:
文档 A → 0.95 ← 最相关
文档 D → 0.88 ← 次相关
文档 B → 0.23 ← 被踢出
文档 C → 0.18 ← 被踢出
文档 E → 0.09 ← 被踢出
LLM 最终拿到 Top‑3 → A、D、… → 全是精华Code example (Cohere Rerank or an open‑source BGE‑Reranker):
import { CohereRerank } from "@langchain/cohere";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
// Cohere Rerank (requires API key)
const reranker = new CohereRerank({
apiKey: process.env.COHERE_API_KEY,
model: "rerank-multilingual-v3.0",
topN: 3,
});
// Two‑step retrieval: first recall 20 docs, then rerank to keep 3
const compressionRetriever = new ContextualCompressionRetriever({
baseCompressor: reranker,
baseRetriever: ensembleRetriever, // hybrid retriever from previous step
});
const rerankedDocs = await compressionRetriever.invoke("你的问题");
// rerankedDocs now contains the top‑3 most relevant chunks for LLM generationFor a fully local setup, the author shows how to wrap the BGE‑Reranker model behind an HTTP endpoint and plug it into the same compression interface.
6. Full RAG Pipeline
The complete end‑to‑end flow stitches together chunking, hybrid retrieval, Rerank, and LLM generation:
原始文档
│
▼
[父子分块]
├── 子块(200 tokens)→ 向量化入库
└── 父块(2000 tokens)→ docstore 存储
│
▼
用户提问
│
├── 向量检索(Top‑10)
└── BM25 检索(Top‑10)
│
▼
[EnsembleRetriever RRF 融合]
│
└── 融合结果(Top‑20)
│
▼
[Rerank 精选]
│
└── Top‑3 最相关块
│
▼
[LLM 生成]
│
└── 最终答案 import { ChatOpenAI } from "@langchain/openai";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { ChatPromptTemplate } from "@langchain/core/prompts";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const prompt = ChatPromptTemplate.fromTemplate(`
You are a technical assistant. Answer the question using the provided context.
If the context does not contain an answer, say "I don't know".
Context:
{context}
Question: {input}
`);
const combineDocsChain = await createStuffDocumentsChain({ llm, prompt });
const ragChain = await createRetrievalChain({
retriever: compressionRetriever, // hybrid + rerank
combineDocsChain,
});
const response = await ragChain.invoke({ input: "LangChain 的 LCEL 和传统 Chain 有什么区别?" });
console.log(response.answer); // Precise answer without hallucination7. Parameter Tuning: What Drives the Final Performance
After the system is built, the following parameters deserve careful adjustment:
Chunk size per document type
文档类型 推荐 chunk_size overlap
FAQ / Q&A 200~300 tokens 50
技术文档(连续段落) 400~600 tokens 100
合同/法律文件 600~800 tokens 150
代码文件(按函数切) 300~500 tokens 50Hybrid retrieval weight
Q&A (precise questions): vector 0.4 + BM25 0.6
Semantic search (fuzzy): vector 0.7 + BM25 0.3
General use (default): vector 0.6 + BM25 0.4
Initial k for the first‑stage retrieval
初检 k = Rerank topN × 5~10
例:最终需要 Top‑3 → 初检至少召回 15~30 条If k is too small, relevant documents never reach the rerank stage.
8. Common Pitfalls and Self‑Check List
Typical mistakes and how to verify them:
Chunking without handling special characters (full‑width spaces, Chinese quotation marks, ellipsis) – they break BM25 tokenization.
Rerank adds latency; mitigate by caching frequent queries or using a lightweight local model (e.g., BGE‑Reranker‑base).
Parent‑Child docstore not persisted – InMemoryStore clears on restart; switch to RedisStore or MongoDBStore for production.
BM25 index not refreshed after document updates – vector stores support real‑time ingestion, but BM25 indexes are usually static.
const checklist = {
分块: {
"chunk_size 是否按文档类型调整": false,
"overlap 是否设置": false,
"父子分块 docstore 是否持久化": false,
},
混合检索: {
"BM25 分词是否处理中文": false,
"RRF 权重是否按场景调整": false,
"初检 k 值是否 ≥ topN × 5": false,
},
Rerank: {
"Rerank topN 是否 ≤ 5": false,
"高频查询是否加缓存": false,
"模型选型是否支持中文": false,
},
};Conclusion
Chunking is the foundation – overly large chunks dilute semantics; overly small chunks lose context; parent‑child chunking offers the best engineering trade‑off.
Overlap cannot be omitted – a 10%‑15% overlap for Chinese prose (20% for technical docs) preserves cross‑chunk sentences.
Hybrid retrieval fills blind spots – vectors capture meaning, BM25 captures exact keywords, and RRF fuses the two rankings.
Rerank selects the essence – from dozens of candidates it picks the top‑few using cross‑attention, outperforming raw vector distance.
Parameters must be tuned per scenario – there is no universal setting; adjust chunk size, hybrid weights, and initial k according to document type and query style.
The next article will explore Milvus’s Memory module and the three kinds of “memory” in AI: truncation, summarization, and retrieval.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
