Mastering Document Chunking for RAG: Strategies, Code & Best Practices
This article explores why proper document chunking is crucial for Retrieval‑Augmented Generation, explains core concepts like context windows and signal‑to‑noise, compares various chunking strategies—from simple fixed‑size splits to semantic and hybrid approaches—and provides practical Python code examples to help you build more effective RAG pipelines.
Why Chunking Matters in RAG
Even with powerful LLMs and carefully crafted prompts, a Retrieval‑Augmented Generation (RAG) system can underperform if the documents are poorly chunked before being embedded. Bad chunks act like noisy, incomplete data, limiting the system’s overall performance.
The Essence of Chunking
Two core constraints drive the need for chunking:
Model context window : LLMs cannot process arbitrarily long texts, so documents must be split into manageable pieces.
Retrieval signal‑to‑noise ratio : Overly large or irrelevant chunks dilute the core signal, making it harder for the retriever to match user intent.
The balance between context completeness and information density is controlled by chunk_size and chunk_overlap.
Basic Chunking Strategies
Fixed‑Length Chunking
Splits text by a preset character count, ignoring logical structure. Simple but can break semantic continuity.
from langchain_text_splitters import CharacterTextSplitter
sample_text = "LangChain was created by Harrison Chase in 2022..."
text_splitter = CharacterTextSplitter(separator=" ", chunk_size=100, chunk_overlap=20, length_function=len)
docs = text_splitter.create_documents([sample_text])
for i, doc in enumerate(docs):
print(f"--- Chunk {i+1} ---")
print(doc.page_content)Recursive Character Chunking
LangChain’s default strategy recursively splits using a hierarchy of separators ("\n\n", "\n", " ", ""). It preserves paragraphs and sentences when possible.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
docs = text_splitter.create_documents([sample_text])
for i, doc in enumerate(docs):
print(f"--- Chunk {i+1} ---")
print(doc.page_content)Sentence‑Based Chunking
Uses NLTK to split text into sentences, then groups them into chunks while maintaining overlap.
import nltk
from nltk.tokenize import sent_tokenize
def chunk_by_sentences(text, max_chars=500, overlap_sentences=1):
sentences = sent_tokenize(text)
chunks = []
current = ""
for i, s in enumerate(sentences):
if len(current) + len(s) <= max_chars:
current += " " + s
else:
chunks.append(current.strip())
start = max(0, i - overlap_sentences)
current = " ".join(sentences[start:i+1])
if current:
chunks.append(current.strip())
return chunksStructure‑Aware Chunking
Markdown Header Chunking
Splits documents based on Markdown heading levels, preserving logical sections.
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown = "# Chapter 1
## Section 1.1
Content..."
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")])
chunks = splitter.split_text(markdown)
for chunk in chunks:
print(chunk.metadata)
print(chunk.page_content)Dialogue Chunking
Groups a fixed number of dialogue turns into a single chunk, useful for customer‑service logs.
dialogue = ["Alice: Hi...", "Bot: How can I help?", ...]
def chunk_dialogue(lines, max_turns=3):
return ["
".join(lines[i:i+max_turns]) for i in range(0, len(lines), max_turns)]Semantic & Topic Chunking
Semantic Chunking
Computes embeddings for adjacent sentences/paragraphs and splits where semantic similarity drops below a threshold.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings
emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
semantic_splitter = SemanticChunker(emb, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=70)
chunks = semantic_splitter.create_documents([long_text])Topic‑Based Chunking (LDA)
Uses LDA to discover dominant topics and splits when the topic changes.
import numpy as np, re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
def lda_topic_chunking(text, n_topics=3):
paragraphs = [p.strip() for p in text.split('
') if p.strip()]
cleaned = [re.sub(r'[^a-zA-Z\s]', '', p).lower() for p in paragraphs]
vec = CountVectorizer(min_df=1, stop_words='english')
X = vec.fit_transform(cleaned)
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda.fit(X)
topics = np.argmax(lda.transform(X), axis=1)
chunks, cur = [], []
cur_topic = topics[0]
for i, para in enumerate(paragraphs):
if topics[i] == cur_topic:
cur.append(para)
else:
chunks.append("
".join(cur))
cur = [para]
cur_topic = topics[i]
chunks.append("
".join(cur))
return chunksAdvanced Strategies
Hybrid Chunking (Structure + Recursion)
First splits by Markdown headers, then recursively splits any oversized chunk.
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
def hybrid_chunking(doc, coarse_thr=400, fine_size=100, fine_overlap=20):
headers = [("#", "Header 1"), ("##", "Header 2")]
coarse = MarkdownHeaderTextSplitter(headers_to_split_on=headers).split_text(doc)
fine_splitter = RecursiveCharacterTextSplitter(chunk_size=fine_size, chunk_overlap=fine_overlap)
final = []
for chunk in coarse:
if len(chunk.page_content) > coarse_thr:
final.extend(fine_splitter.split_documents([chunk]))
else:
final.append(chunk)
return finalAgentic Chunking
Uses an LLM agent to dynamically decide chunk boundaries; currently a placeholder for experimental use.
def agentic_chunker(paragraph_text):
# Placeholder for LLM‑driven chunking logic
return []How to Choose the Best Strategy
Start with the simple baseline RecursiveCharacterTextSplitter. If the document has clear headings, switch to MarkdownHeaderTextSplitter. When retrieval precision is insufficient, consider semantic chunking or the small‑big (parent‑document) approach. For highly complex or mixed‑format documents, combine structural and recursive methods in a hybrid pipeline.
Conclusion
Effective chunking is a foundational engineering step that directly impacts RAG performance. There is no universal “silver bullet”; practitioners should begin with simple, reliable splitters, iteratively evaluate retrieval quality, and progressively adopt more sophisticated or hybrid techniques as needed.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
