Why Chunking Can Make or Break Your RAG System – Practical Strategies & Code
This article explains how proper document chunking—choosing the right chunk size, overlap, and structure‑aware boundaries—directly impacts the relevance, factuality, and efficiency of Retrieval‑Augmented Generation pipelines, and provides multiple Python implementations ranging from simple fixed‑length splits to semantic and hybrid approaches.
Background
In Retrieval‑Augmented Generation (RAG) systems, even with powerful LLMs and well‑crafted prompts, missing context, factual errors, or incoherent stitching can occur. The real bottleneck often lies before data is stored: how the documents are chunked. Poor chunking breaks semantic boundaries, mixes noise, and presents fragmented pieces to the model, limiting performance.
What is Chunking?
Chunking is the process of breaking a large text into smaller, manageable segments, which makes embedding and retrieval more efficient and improves relevance.
Why Chunk Content?
Model context window limits : LLMs cannot process arbitrarily long inputs; chunking respects natural boundaries (titles, paragraphs, code blocks) and avoids cutting important information.
Signal‑to‑noise ratio : Too large chunks introduce noise; too small chunks lack sufficient context. Proper chunk size balances recall and precision.
Semantic continuity : Overlap windows preserve cross‑chunk cues, preventing loss of definitions or conditions.
Chunking Strategies
Basic Chunking
Fixed‑length character splitting (e.g., 600 characters with 15% overlap) is simple and fast but often ignores document structure.
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(separator="", chunk_size=600, chunk_overlap=90)
chunks = splitter.split_text(text)Structure‑Aware Chunking
Uses headings, lists, code blocks, tables as natural boundaries, then applies a small overlap.
Sentence‑Level Chunking
First split by Chinese punctuation, then group sentences until a target chunk size is reached.
def split_sentences_zh(text: str):
pattern = re.compile(r'([^。!?;]*[。!?;]+|[^。!?;]+$)')
return [m.group(0).strip() for m in pattern.finditer(text) if m.group(0).strip()]
def sentence_chunk(text, chunk_size=600, overlap=90):
sents = split_sentences_zh(text)
chunks, buf = [], ""
for s in sents:
if len(buf) + len(s) <= chunk_size:
buf += s
else:
chunks.append(buf)
buf = (buf[-overlap:] if overlap > 0 and len(buf) > overlap else "") + s
if buf:
chunks.append(buf)
return chunksRecursive Character Chunking
Recursively splits by a hierarchy of separators (titles → paragraphs → lines → spaces → characters) while respecting a maximum chunk size.
from langchain_text_splitters import RecursiveCharacterTextSplitter
separators = [r"
#{1,6}\s", r"
\d+(?:\.\d+)*\s", "
", "
", " ", ""]
splitter = RecursiveCharacterTextSplitter(separators=separators,
chunk_size=700,
chunk_overlap=100,
is_separator_regex=True)
chunks = splitter.split_text(text)Semantic Chunking
Embeds each sentence, computes novelty scores, and cuts when semantic similarity drops below a dynamic threshold.
def semantic_chunk(text, model_name="BAAI/bge-m3", window_size=2,
min_chars=350, max_chars=1100, lambda_std=0.8,
overlap_chars=80):
sents = split_sentences_zh(text)
model = SentenceTransformer(model_name)
emb = model.encode(sents, normalize_embeddings=True)
# compute novelty and split...
return chunksHybrid Chunking
Combines coarse structure‑aware splitting with finer‑grained strategies (sentence, semantic, or recursive) based on chunk length and content type, adding optional small overlaps between adjacent chunks.
Conclusion
Effective chunking balances context completeness and information density. Proper chunk size and overlap, aligned with natural document boundaries, significantly improve retrieval relevance and answer factuality in RAG pipelines.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
