How to Feed Massive Documents to an RAG System: Mastering the Art of Text Chunking
This article explains why proper text chunking is critical for Retrieval‑Augmented Generation, illustrates common pitfalls with real‑world examples, compares four chunking strategies (fixed length, recursive, structure‑aware, and code‑aware), and provides practical guidelines for chunk size, overlap, metadata handling, and a production‑ready pipeline.
Why Chunking Matters in RAG
Feeding hundreds of pages of technical documents into a RAG system often yields irrelevant or fragmented answers because the documents are split into thousands of tiny fragments that lose context. For example, a legal QA system split contracts every 500 characters returned three unrelated chunks about breach penalties, and even a fourth unrelated rental‑contract chunk, resulting in a disjointed answer.
Chunking sits between data cleaning and vector retrieval: clean documents must be split into appropriately sized chunks before vectorization, and the quality of those chunks directly determines retrieval accuracy and the LLM's ability to generate coherent answers.
Four Chunking Strategies
Strategy 1: Fixed‑Length Chunking – Simple but Brutal
Splits text purely by character or token count, e.g. 500 characters with 50‑character overlap:
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500, # each chunk ~500 chars
chunk_overlap=50, # overlap to preserve some context
separator='
' # split on paragraphs
)
chunks = splitter.split_text(text)This method can cut sentences and terms in half, e.g. "生成能力" becomes "生成能" and "力", breaking semantic meaning.
When to use:
Quick prototyping
Uniform documents with clear paragraph boundaries
Baseline for comparing other strategies
When to avoid: documents rich in technical terms, code files, or high semantic density.
Strategy 2: Recursive Character Splitter – LangChain’s Default
Attempts separators in order (paragraph → sentence → word → character) to respect semantic boundaries:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=700,
chunk_overlap=100,
separators=["
", "
", "。", "!", "?", " ", ""]
)
chunks = splitter.split_text(text)In the RAG example sentence, the whole sentence remains intact after splitting.
When to use:
General text (news, blogs, plain documents)
Documents without special structure
Default strategy for most pipelines
Strategy 3: Structure‑Aware Splitters
For Markdown:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(markdown_text)For HTML headers:
from langchain.text_splitter import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "header1"), ("h2", "header2"), ("h3", "header3")])
chunks = splitter.split_text(html_text)These splitters keep each heading and its content together, preventing the loss of context seen in the earlier legal‑contract example.
When to use:
Markdown documents – use MarkdownTextSplitter HTML or other markup with clear heading hierarchy – use HTMLHeaderTextSplitter PDFs with a table of contents – guide splitting with that structure
Strategy 4: Code‑Aware Splitters – Programmer’s Blessing
Splits code by language syntax to keep functions or classes intact:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
py_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=300,
chunk_overlap=0
)
js_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.JS,
chunk_size=300,
chunk_overlap=0
)
chunks = py_splitter.split_text(python_code)Supported languages include Python, JavaScript, Java, Go, Rust, C++, TypeScript, SQL, HTML, CSS, etc.
Special considerations for code:
Overlap is usually set to 0 because code boundaries are clear.
Chunk size 200‑500 tokens is enough for a typical function.
Preserve metadata such as file name and function name for traceability.
Choosing Chunk Size – An Art
Recommended ranges based on document type (tokens) and typical overlap percentages:
General text: 500‑750 tokens, overlap 10‑20% (balances recall precision and context).
Technical documents: 512‑1024 tokens, overlap ~15% (high term density).
Code files: 200‑500 tokens, overlap 0 (clear syntax boundaries).
FAQ/Q&A: 300‑500 tokens, overlap 10% (each Q‑A pair forms a self‑contained chunk).
Legal/medical: 400‑600 tokens, overlap 25‑30% (very high semantic density).
Adjust chunk size according to the embedding model’s context window (e.g., OpenAI text‑embedding‑3‑small has 8191 tokens; keep chunk size ≤ 1/8‑1/10 of that).
Three Common Chunking Pitfalls
Pitfall 1: Semantic Cut‑off of Technical Terms
In a medical QA system, the term "糖化血红蛋白测定" was split into two fragments, causing the LLM to miss the full term.
Mitigation:
Use recursive splitter to prioritize sentence boundaries.
Increase overlap to 25‑30% for terminology‑dense docs.
Apply a Sentence Window strategy that returns parent chunk context.
Pitfall 2: One‑Size‑Fits‑All Chunk Size
Applying a uniform chunk_size=500 to code, Markdown, PDF, and HTML broke code semantics and polluted vector stores.
Mitigation: select splitter per file type (see the get_splitter_for_file example below).
def get_splitter_for_file(filepath):
ext = filepath.split('.')[-1].lower()
# Code files
if ext in ['py', 'js', 'java', 'go', 'rs']:
lang_map = {'py': Language.PYTHON, 'js': Language.JS, 'java': Language.JAVA}
return RecursiveCharacterTextSplitter.from_language(
language=lang_map.get(ext, Language.PYTHON),
chunk_size=300,
chunk_overlap=0
)
# Markdown
elif ext in ['md', 'markdown']:
return MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)
# HTML
elif ext in ['html', 'htm']:
return HTMLHeaderTextSplitter(headers_to_split_on=[('h1','h1'),('h2','h2')])
else:
return RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100)Pitfall 3: Ignoring Document Structure
When headings are ignored, retrieved chunks lack surrounding context, as seen in a corporate Wiki where "## 3.2 Redis集群配置" was split away from its description.
Mitigation:
Use MarkdownTextSplitter for Markdown.
Use HTMLHeaderTextSplitter for HTML.
Leverage PDF TOC when available.
Apply NLP tools to infer paragraph boundaries for unstructured docs.
Metadata – Don’t Throw It Away
Retaining metadata (source file, chapter, timestamps, tags) enables filtering, ranking, and traceability of retrieved chunks.
# Example: filter by chapter
retriever = vectorstore.as_retriever(
search_kwargs={"k": 5, "filter": {"chapter": "第3章"}}
)Typical metadata to keep:
Document source (filename, URL)
Chapter or heading information
Creation / update timestamps
Document type (code, FAQ, etc.)
Custom tags (important, verified)
A Complete Chunking Pipeline
The following production‑ready pipeline automatically selects the proper splitter based on file extension and returns Document objects with enriched metadata.
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
MarkdownTextSplitter,
HTMLHeaderTextSplitter,
Language
)
from langchain.schema import Document
from typing import List, Dict, Optional
import os
class SmartSplitter:
"""Intelligent splitter that chooses a strategy based on file type"""
def __init__(self):
self.splitter_map = {}
def split_document(self, content: str, filepath: str, metadata: Optional[Dict] = None) -> List[Document]:
ext = os.path.splitext(filepath)[1].lower()
metadata = metadata or {}
metadata['source'] = filepath
if ext in ['.py', '.js', '.java', '.go', '.rs', '.ts']:
chunks = self._split_code(content, ext, metadata)
elif ext in ['.md', '.markdown']:
chunks = self._split_markdown(content, metadata)
elif ext in ['.html', '.htm']:
chunks = self._split_html(content, metadata)
else:
chunks = self._split_text(content, metadata)
return chunks
def _split_code(self, code: str, ext: str, metadata: Dict) -> List[Document]:
lang_map = {'.py': Language.PYTHON, '.js': Language.JS, '.java': Language.JAVA,
'.go': Language.GO, '.ts': Language.TS}
splitter = RecursiveCharacterTextSplitter.from_language(
language=lang_map.get(ext, Language.PYTHON),
chunk_size=300,
chunk_overlap=0
)
chunks = splitter.split_text(code)
return [Document(page_content=chunk, metadata={**metadata, 'chunk_id': i})
for i, chunk in enumerate(chunks)]
def _split_markdown(self, text: str, metadata: Dict) -> List[Document]:
splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(text)
return [Document(page_content=chunk, metadata={**metadata, 'chunk_id': i})
for i, chunk in enumerate(chunks)]
def _split_html(self, text: str, metadata: Dict) -> List[Document]:
splitter = HTMLHeaderTextSplitter(headers_to_split_on=[('h1','h1'),('h2','h2'),('h3','h3')])
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
chunk.metadata.update(metadata)
chunk.metadata['chunk_id'] = i
return chunks
def _split_text(self, text: str, metadata: Dict) -> List[Document]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=700,
chunk_overlap=100,
separators=["
", "
", "。", "!", "?", " ", ""]
)
chunks = splitter.split_text(text)
return [Document(page_content=chunk, metadata={**metadata, 'chunk_id': i})
for i, chunk in enumerate(chunks)]
# Usage example
if __name__ == "__main__":
splitter = SmartSplitter()
# Python code example
py_code = """
def calculate_rag_score(query, documents):
'''计算RAG相关性得分'''
scores = []
for doc in documents:
score = cosine_similarity(query, doc)
scores.append(score)
return sorted(scores, reverse=True)
"""
chunks = splitter.split_document(py_code, "test.py")
print(f"Python code produced {len(chunks)} chunks")
# Markdown example
md_text = """
# RAG技术详解
## 什么是RAG
RAG是检索增强生成的缩写...
## RAG的优势
RAG有以下优势...
"""
chunks = splitter.split_document(md_text, "test.md")
print(f"Markdown produced {len(chunks)} chunks")Key Takeaways
Match the chunking strategy to the document type – code uses syntax‑aware splitters, Markdown uses heading‑aware splitters, and generic text can rely on recursive splitting.
Start with 512 tokens + 15% overlap for most scenarios, then fine‑tune via A/B testing of recall precision and recall rate.
Prevent semantic cut‑offs by increasing overlap for terminology‑dense docs and by preserving structural metadata.
Next, the series will cover vectorization, model selection, and dimensionality considerations.
AI Architect Hub
Discuss AI and architecture; a ten-year veteran of major tech companies now transitioning to AI and continuing the journey.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
