Mastering Chunk Splitting for RAG: From Fixed Length to Semantic Segmentation

Chunk splitting, a critical yet often overlooked step in RAG pipelines, dramatically impacts retrieval recall and LLM output quality; this guide walks through three evolution stages—from naive fixed‑length splits to sentence‑aware overlaps and finally semantic, structure‑driven segmentation—complete with code, experiments, and practical pitfalls.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Mastering Chunk Splitting for RAG: From Fixed Length to Semantic Segmentation

Why Chunk Splitting Matters

In Retrieval‑Augmented Generation (RAG) systems, the way documents are broken into chunks directly influences two core metrics: retrieval recall and the quality of LLM‑generated answers. A poor split can cause relevant content to be missed or give the LLM incomplete information, leading to incorrect responses.

Three Generations of Chunking Strategies

V1 – Fixed‑Length Splitting

The simplest approach cuts text into 512‑token pieces with a 100‑token overlap. This quickly fails on insurance contracts where clauses are split, causing a recall rate of only 67% .

V2 – Sentence‑Boundary Splitting

Using punctuation as split points improves recall to 74% , but list items and headings can still be separated, breaking semantic meaning.

V3 – Semantic, Structure‑Aware Splitting (Final Solution)

The core idea shifts from "shorten text" to "recognize semantic units and split accordingly". The process identifies document hierarchy (titles, subtitles, paragraphs, tables, images) and treats each logical unit as a chunk, achieving a recall of 91% .

Core Logic of the V3 Solution

Step 1: Identify Document Hierarchy

We combine three strategies:

# Strategy 1: Regex patterns for common numbering
patterns = [
    r'^第[一二三四五六七八九十百]+条',   # 第三条
    r'^\d+\.\d+\.\d+',           # 1.1.1
    r'^([一二三四五]+)',            # (一)
    r'^\d+\.'                       # 1.
]

# Strategy 2: Use style information (font size, bold, indent)
# Strategy 3: Train a lightweight XGBoost classifier on features such as numbering type, font size, boldness, indent level

Training on 500 annotated documents yields a hierarchy‑recognition accuracy of 94%, far above the 77% from pure rules.

Step 2: Semantic Unit Splitting

After hierarchy detection, splitting follows a recursive rule set:

If a chapter ≤ 1024 tokens, keep it as a single chunk.

If longer, split by sub‑titles; each sub‑chapter is processed recursively.

If no sub‑titles, accumulate paragraphs until the token limit is reached.

If a single paragraph exceeds the limit, fall back to sentence boundaries.

Tables and code blocks are treated as indivisible units; large tables become separate chunks.

Step 3: Semantic Integrity Check

After initial splitting, adjacent chunks are examined for improper breaks. The following helper merges chunks that should stay together:

def should_merge(chunk1, chunk2):
    # Scene 1: chunk1 ends with a colon, likely a list follows
    if chunk1.strip().endswith((':', ':', ';', ';')):
        return True
    # Scene 2: chunk2 starts with a list item marker
    if re.match(r'^[((]\d+[))]|^[①②③④⑤]|^[a-z]\.', chunk2.strip()):
        return True
    # Scene 3: chunk2 begins with a transition word
    if chunk2.strip().startswith(('但', '然而', '除外', '不包括')):
        return True
    return False

This check eliminated the "nuclear radiation" incident where the exclusion clause was mistakenly separated.

Step 4: Smart Overlap

Fixed overlap can cut sentences mid‑way. We compute overlap based on the nearest sentence boundary:

def add_smart_overlap(chunks, overlap_tokens=100):
    for i in range(1, len(chunks)):
        prev_text = chunks[i-1].content
        overlap_text = get_last_n_tokens(prev_text, overlap_tokens)
        sentence_ends = ['。', '?', '!', '.', '?', '!']
        last_end = max(overlap_text.rfind(e) for e in sentence_ends)
        if last_end > 0:
            overlap_text = overlap_text[last_end+1:]
        chunks[i].content = overlap_text + chunks[i].content
    return chunks

Experiments showed that 100‑token overlap offers the best trade‑off: recall rises to 0.91, storage increases by 10 %, and latency grows only 5 ms.

overlap  recall  storage  latency
0        0.81    0%       baseline
50       0.86    +5%      +2ms
100      0.91    +10%     +5ms
200      0.92    +20%     +12ms

Handling Special Elements

Tables: Small tables stay whole; large tables are split by groups or fixed rows, always copying the header into each chunk so the LLM knows column meanings.

Images: Diagrams are either described by a multimodal model or converted to structured data; decorative images are ignored.

List Items: The leading sentence is merged with all list entries; if the merged chunk exceeds the limit, the leading sentence is repeated at the start of each sub‑chunk.

Chunk Metadata Design

Each chunk carries rich metadata to aid retrieval and presentation:

{
  "content": "保险金额为50万元...",
  "metadata": {
    "doc_id": "contract_123",
    "chunk_id": "chunk_045",
    "page_num": 5,
    "section_path": "第二章 保险条款 > 2.1 保险责任",
    "content_type": "text",
    "is_key_clause": true,
    "prev_chunk_id": "chunk_044",
    "next_chunk_id": "chunk_046",
    "confidence": 0.96
  }
}

Key fields:

section_path: Enables answer provenance.

is_key_clause: Gives higher weight to critical clauses during retrieval.

prev/next_chunk_id: Allows automatic context expansion when a retrieved chunk is semantically incomplete.

Five Real‑World Pitfalls and Fixes

Pitfall 1: Chunk size too small

512‑token chunks split long insurance clauses, losing meaning. Fix: Increase to 1024 tokens, with up to 1536 for key sections.

Pitfall 2: Table header loss

Large tables were split without headers, confusing the LLM. Fix: Replicate the header in every table chunk and annotate part numbers.

Pitfall 3: List items isolated

Separate list items lost their introductory context. Fix: Merge the leading sentence with all list entries.

Pitfall 4: Cross‑page paragraph break

Paragraphs spanning pages were treated as two separate chunks. Fix: Detect missing chapter titles and merge across pages.

Pitfall 5: Overlap cuts mid‑sentence

Fixed 100‑token overlap sometimes split a sentence, leaving fragments. Fix: Use sentence‑boundary‑aware overlap as described above.

Interview Answer Blueprint

Explain why chunking matters (10 s): it affects recall and LLM answer quality.

Summarize the three generations (30 s): 67 % → 74 % → 91 % recall.

Detail the key tactics (1 min): regex + style + XGBoost hierarchy detection, immutable table/list handling, recursive splitting, semantic integrity check, smart overlap (100 tokens).

Highlight common pitfalls and solutions (30 s).

Final Thoughts

Chunk splitting may lack flashy models or complex math, but it is the most "ground‑level" component of a RAG system. Optimizing it from 67 % to 91 % recall delivers the bulk of performance gains, and interviewers love probing this topic to assess a candidate’s depth of engineering insight.

LLMRAGretrievalsemantic segmentationChunking
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.