Boost RAG Performance: Chunking Strategies, Rerank, and Hybrid Retrieval Explained

This article breaks down why RAG pipelines often underperform and shows how proper chunking, overlap settings, hybrid vector‑plus‑BM25 retrieval, and a Rerank step can dramatically improve recall and precision, with concrete code examples and tuning tips.

James' Growth Diary
James' Growth Diary
James' Growth Diary
Boost RAG Performance: Chunking Strategies, Rerank, and Hybrid Retrieval Explained

Why RAG Often Falls Short

The author observes that many users finish building a Milvus vector store but still see low recall because the bottleneck lies in the retrieval stage, not the LLM prompt.

Chunking Strategy – The Foundation

Chunking splits long documents into smaller pieces before indexing. Too large chunks (>1000 tokens) dilute semantics, while too small chunks (<50 tokens) break context. An optimal size of 200‑500 tokens with about 20% overlap preserves complete semantic units.

Four common chunking methods are compared:

Fixed‑size : simple and fast but may cut semantics.

Recursive character : respects natural boundaries, needs parameter tuning.

Semantic : yields semantically complete chunks at high computational cost.

Parent‑Child : retrieves small child chunks for precision and returns the larger parent chunk for context; this is the most used engineering solution.

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { InMemoryStore } from "@langchain/core/stores";

// Parent chunk (large, 2000 tokens, 200 overlap)
const parentSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 2000,
  chunkOverlap: 200,
});

// Child chunk (small, 200 tokens, 20 overlap)
const childSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 200,
  chunkOverlap: 20,
});

const docstore = new InMemoryStore();
const vectorStore = await MemoryVectorStore.fromDocuments([], embeddings);

const retriever = new ParentDocumentRetriever({
  vectorstore: vectorStore,
  docstore,
  parentSplitter,
  childSplitter,
});
await retriever.addDocuments(docs);
const results = await retriever.invoke("Your question");
console.log(results[0].pageContent); // returns the full parent chunk
Parent‑Child chunking architecture
Parent‑Child chunking architecture

Overlap Settings – Keeping Boundaries Intact

Overlap (the sliding window) is crucial for preserving sentences that span chunks. For example, without overlap the phrase "performance improves 40%" is lost when the chunk boundary cuts after the mention of the M3 chip. Setting overlap=100 tokens keeps the context.

Chinese prose: overlap ≈ 10%‑15% of chunkSize.

Technical documentation (including code): overlap ≈ 20% of chunkSize.

Pure code: split at function or class boundaries instead of fixed windows.

Hybrid Retrieval – Vector + BM25

Pure vector search misses exact keyword matches, while pure BM25 misses semantically similar passages. Combining both with Reciprocal Rank Fusion (RRF) yields the highest recall.

import { BM25Retriever } from "@langchain/community/retrievers/bm25";
import { EnsembleRetriever } from "langchain/retrievers/ensemble";

const vectorRetriever = vectorStore.asRetriever({ k: 10 });
const bm25Retriever = BM25Retriever.fromDocuments(docs, { k: 10 });

const ensembleRetriever = new EnsembleRetriever({
  retrievers: [vectorRetriever, bm25Retriever],
  weights: [0.6, 0.4], // adjust per scenario
});
const results = await ensembleRetriever.invoke("LangChain v0.3 breaking change");

RRF formula: RRF(D) = 1/(k + r1) + 1/(k + r2) where r1 and r2 are the ranks from vector and BM25 results (k=60). Documents that rank high in both lists receive the highest combined score.

Hybrid retrieval flowchart
Hybrid retrieval flowchart

Rerank – Selecting the Most Relevant Chunks

After hybrid retrieval, the LLM can only ingest a few chunks (3‑5). A Rerank model re‑scores the retrieved documents using cross‑attention, which is more accurate than raw vector similarity.

import { CohereRerank } from "@langchain/cohere";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";

const reranker = new CohereRerank({
  apiKey: process.env.COHERE_API_KEY,
  model: "rerank-multilingual-v3.0",
  topN: 3,
});

const compressionRetriever = new ContextualCompressionRetriever({
  baseCompressor: reranker,
  baseRetriever: ensembleRetriever,
});
const rerankedDocs = await compressionRetriever.invoke("Your question");

Open‑source alternative: BGE‑Reranker (served locally via HTTP) can be used without an API key.

Full RAG Pipeline

The complete chain links parent‑child chunking, hybrid retrieval, Rerank, and LLM generation.

import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const prompt = ChatPromptTemplate.fromTemplate(`
You are a technical assistant. Answer the question based on the context.
If the context does not contain the answer, say "I don't know".

Context:
{context}

Question: {input}
`);

const combineDocsChain = await createStuffDocumentsChain({ llm, prompt });
const ragChain = await createRetrievalChain({
  retriever: compressionRetriever, // hybrid + rerank
  combineDocsChain,
});
const response = await ragChain.invoke({ input: "What is the difference between LangChain LCEL and traditional Chain?" });
console.log(response.answer);
Complete RAG pipeline diagram
Complete RAG pipeline diagram

Parameter Tuning – What Drives the Final Performance

Recommended chunk_size and overlap per document type:

Document type          Recommended chunk_size   overlap
------------------------------------------------------
FAQ / Q&A pairs       200‑300 tokens           50
Technical docs        400‑600 tokens           100
Legal contracts       600‑800 tokens           150
Code files (by func)  300‑500 tokens           50

Hybrid weight adjustments:

Exact‑answer Q&A: vector 0.4 + BM25 0.6

Semantic search: vector 0.7 + BM25 0.3

General use (default): vector 0.6 + BM25 0.4

Initial k for the first retrieval should be topN × 5‑10. For a final Top‑3, retrieve at least 15‑30 candidates.

Common Pitfalls and Self‑Check List

Typical mistakes include not cleaning special characters in Chinese text, Rerank latency, non‑persistent docstore, and stale BM25 indexes. The author provides a JavaScript checklist object to verify each step.

const checklist = {
  chunking: {
    "chunk_size adjusted per doc type": false,
    "overlap set": false,
    "parent‑child docstore persisted": false,
  },
  hybrid: {
    "BM25 tokenization handles Chinese": false,
    "RRF weights tuned": false,
    "initial k ≥ topN × 5": false,
  },
  rerank: {
    "rerank topN ≤ 5": false,
    "cache frequent queries": false,
    "model supports Chinese": false,
  },
};

Conclusion

Chunking is the foundation – size and overlap must be balanced; parent‑child is the engineering sweet spot.

Overlap prevents loss of cross‑chunk semantics; 10‑15% for Chinese prose, 20% for technical docs.

Hybrid retrieval (vector + BM25) fills each other's blind spots, with RRF merging the results.

Rerank selects the most relevant chunks from the hybrid pool, using cross‑attention for higher precision.

All parameters (chunk size, overlap, hybrid weights, initial k) need scenario‑specific tuning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LangChainRAGMilvusBM25ChunkingHybrid RetrievalRerank
James' Growth Diary
Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.