How to Feed Massive Documents to an RAG System: Mastering the Art of Text Chunking

This article explains why proper text chunking is critical for Retrieval‑Augmented Generation, illustrates common pitfalls with real‑world examples, compares four chunking strategies (fixed length, recursive, structure‑aware, and code‑aware), and provides practical guidelines for chunk size, overlap, metadata handling, and a production‑ready pipeline.

AI Architect Hub
AI Architect Hub
AI Architect Hub
How to Feed Massive Documents to an RAG System: Mastering the Art of Text Chunking

Why Chunking Matters in RAG

Feeding hundreds of pages of technical documents into a RAG system often yields irrelevant or fragmented answers because the documents are split into thousands of tiny fragments that lose context. For example, a legal QA system split contracts every 500 characters returned three unrelated chunks about breach penalties, and even a fourth unrelated rental‑contract chunk, resulting in a disjointed answer.

Chunking sits between data cleaning and vector retrieval: clean documents must be split into appropriately sized chunks before vectorization, and the quality of those chunks directly determines retrieval accuracy and the LLM's ability to generate coherent answers.

Four Chunking Strategies

Strategy 1: Fixed‑Length Chunking – Simple but Brutal

Splits text purely by character or token count, e.g. 500 characters with 50‑character overlap:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,  # each chunk ~500 chars
    chunk_overlap=50,  # overlap to preserve some context
    separator='

'  # split on paragraphs
)
chunks = splitter.split_text(text)

This method can cut sentences and terms in half, e.g. "生成能力" becomes "生成能" and "力", breaking semantic meaning.

When to use:

Quick prototyping

Uniform documents with clear paragraph boundaries

Baseline for comparing other strategies

When to avoid: documents rich in technical terms, code files, or high semantic density.

Strategy 2: Recursive Character Splitter – LangChain’s Default

Attempts separators in order (paragraph → sentence → word → character) to respect semantic boundaries:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=700,
    chunk_overlap=100,
    separators=["

", "
", "。", "!", "?", " ", ""]
)
chunks = splitter.split_text(text)

In the RAG example sentence, the whole sentence remains intact after splitting.

When to use:

General text (news, blogs, plain documents)

Documents without special structure

Default strategy for most pipelines

Strategy 3: Structure‑Aware Splitters

For Markdown:

from langchain.text_splitter import MarkdownTextSplitter

splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(markdown_text)

For HTML headers:

from langchain.text_splitter import HTMLHeaderTextSplitter

splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "header1"), ("h2", "header2"), ("h3", "header3")])
chunks = splitter.split_text(html_text)

These splitters keep each heading and its content together, preventing the loss of context seen in the earlier legal‑contract example.

When to use:

Markdown documents – use MarkdownTextSplitter HTML or other markup with clear heading hierarchy – use HTMLHeaderTextSplitter PDFs with a table of contents – guide splitting with that structure

Strategy 4: Code‑Aware Splitters – Programmer’s Blessing

Splits code by language syntax to keep functions or classes intact:

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

py_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=300,
    chunk_overlap=0
)
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=300,
    chunk_overlap=0
)
chunks = py_splitter.split_text(python_code)

Supported languages include Python, JavaScript, Java, Go, Rust, C++, TypeScript, SQL, HTML, CSS, etc.

Special considerations for code:

Overlap is usually set to 0 because code boundaries are clear.

Chunk size 200‑500 tokens is enough for a typical function.

Preserve metadata such as file name and function name for traceability.

Choosing Chunk Size – An Art

Recommended ranges based on document type (tokens) and typical overlap percentages:

General text: 500‑750 tokens, overlap 10‑20% (balances recall precision and context).

Technical documents: 512‑1024 tokens, overlap ~15% (high term density).

Code files: 200‑500 tokens, overlap 0 (clear syntax boundaries).

FAQ/Q&A: 300‑500 tokens, overlap 10% (each Q‑A pair forms a self‑contained chunk).

Legal/medical: 400‑600 tokens, overlap 25‑30% (very high semantic density).

Adjust chunk size according to the embedding model’s context window (e.g., OpenAI text‑embedding‑3‑small has 8191 tokens; keep chunk size ≤ 1/8‑1/10 of that).

Three Common Chunking Pitfalls

Pitfall 1: Semantic Cut‑off of Technical Terms

In a medical QA system, the term "糖化血红蛋白测定" was split into two fragments, causing the LLM to miss the full term.

Mitigation:

Use recursive splitter to prioritize sentence boundaries.

Increase overlap to 25‑30% for terminology‑dense docs.

Apply a Sentence Window strategy that returns parent chunk context.

Pitfall 2: One‑Size‑Fits‑All Chunk Size

Applying a uniform chunk_size=500 to code, Markdown, PDF, and HTML broke code semantics and polluted vector stores.

Mitigation: select splitter per file type (see the get_splitter_for_file example below).

def get_splitter_for_file(filepath):
    ext = filepath.split('.')[-1].lower()
    # Code files
    if ext in ['py', 'js', 'java', 'go', 'rs']:
        lang_map = {'py': Language.PYTHON, 'js': Language.JS, 'java': Language.JAVA}
        return RecursiveCharacterTextSplitter.from_language(
            language=lang_map.get(ext, Language.PYTHON),
            chunk_size=300,
            chunk_overlap=0
        )
    # Markdown
    elif ext in ['md', 'markdown']:
        return MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)
    # HTML
    elif ext in ['html', 'htm']:
        return HTMLHeaderTextSplitter(headers_to_split_on=[('h1','h1'),('h2','h2')])
    else:
        return RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100)

Pitfall 3: Ignoring Document Structure

When headings are ignored, retrieved chunks lack surrounding context, as seen in a corporate Wiki where "## 3.2 Redis集群配置" was split away from its description.

Mitigation:

Use MarkdownTextSplitter for Markdown.

Use HTMLHeaderTextSplitter for HTML.

Leverage PDF TOC when available.

Apply NLP tools to infer paragraph boundaries for unstructured docs.

Metadata – Don’t Throw It Away

Retaining metadata (source file, chapter, timestamps, tags) enables filtering, ranking, and traceability of retrieved chunks.

# Example: filter by chapter
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5, "filter": {"chapter": "第3章"}}
)

Typical metadata to keep:

Document source (filename, URL)

Chapter or heading information

Creation / update timestamps

Document type (code, FAQ, etc.)

Custom tags (important, verified)

A Complete Chunking Pipeline

The following production‑ready pipeline automatically selects the proper splitter based on file extension and returns Document objects with enriched metadata.

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownTextSplitter,
    HTMLHeaderTextSplitter,
    Language
)
from langchain.schema import Document
from typing import List, Dict, Optional
import os

class SmartSplitter:
    """Intelligent splitter that chooses a strategy based on file type"""
    def __init__(self):
        self.splitter_map = {}

    def split_document(self, content: str, filepath: str, metadata: Optional[Dict] = None) -> List[Document]:
        ext = os.path.splitext(filepath)[1].lower()
        metadata = metadata or {}
        metadata['source'] = filepath
        if ext in ['.py', '.js', '.java', '.go', '.rs', '.ts']:
            chunks = self._split_code(content, ext, metadata)
        elif ext in ['.md', '.markdown']:
            chunks = self._split_markdown(content, metadata)
        elif ext in ['.html', '.htm']:
            chunks = self._split_html(content, metadata)
        else:
            chunks = self._split_text(content, metadata)
        return chunks

    def _split_code(self, code: str, ext: str, metadata: Dict) -> List[Document]:
        lang_map = {'.py': Language.PYTHON, '.js': Language.JS, '.java': Language.JAVA,
                    '.go': Language.GO, '.ts': Language.TS}
        splitter = RecursiveCharacterTextSplitter.from_language(
            language=lang_map.get(ext, Language.PYTHON),
            chunk_size=300,
            chunk_overlap=0
        )
        chunks = splitter.split_text(code)
        return [Document(page_content=chunk, metadata={**metadata, 'chunk_id': i})
                for i, chunk in enumerate(chunks)]

    def _split_markdown(self, text: str, metadata: Dict) -> List[Document]:
        splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)
        chunks = splitter.split_text(text)
        return [Document(page_content=chunk, metadata={**metadata, 'chunk_id': i})
                for i, chunk in enumerate(chunks)]

    def _split_html(self, text: str, metadata: Dict) -> List[Document]:
        splitter = HTMLHeaderTextSplitter(headers_to_split_on=[('h1','h1'),('h2','h2'),('h3','h3')])
        chunks = splitter.split_text(text)
        for i, chunk in enumerate(chunks):
            chunk.metadata.update(metadata)
            chunk.metadata['chunk_id'] = i
        return chunks

    def _split_text(self, text: str, metadata: Dict) -> List[Document]:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=700,
            chunk_overlap=100,
            separators=["

", "
", "。", "!", "?", " ", ""]
        )
        chunks = splitter.split_text(text)
        return [Document(page_content=chunk, metadata={**metadata, 'chunk_id': i})
                for i, chunk in enumerate(chunks)]

# Usage example
if __name__ == "__main__":
    splitter = SmartSplitter()
    # Python code example
    py_code = """
def calculate_rag_score(query, documents):
    '''计算RAG相关性得分'''
    scores = []
    for doc in documents:
        score = cosine_similarity(query, doc)
        scores.append(score)
    return sorted(scores, reverse=True)
"""
    chunks = splitter.split_document(py_code, "test.py")
    print(f"Python code produced {len(chunks)} chunks")
    # Markdown example
    md_text = """
# RAG技术详解
## 什么是RAG
RAG是检索增强生成的缩写...
## RAG的优势
RAG有以下优势...
"""
    chunks = splitter.split_document(md_text, "test.md")
    print(f"Markdown produced {len(chunks)} chunks")

Key Takeaways

Match the chunking strategy to the document type – code uses syntax‑aware splitters, Markdown uses heading‑aware splitters, and generic text can rely on recursive splitting.

Start with 512 tokens + 15% overlap for most scenarios, then fine‑tune via A/B testing of recall precision and recall rate.

Prevent semantic cut‑offs by increasing overlap for terminology‑dense docs and by preserving structural metadata.

Next, the series will cover vectorization, model selection, and dimensionality considerations.

LangChainmetadataRAGVectorizationAI retrievaltext chunking
AI Architect Hub
Written by

AI Architect Hub

Discuss AI and architecture; a ten-year veteran of major tech companies now transitioning to AI and continuing the journey.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.