How Smart Chunk Splitting Boosts RAG Recall from 67% to 91%

This article examines the critical role of chunk splitting in Retrieval‑Augmented Generation systems, comparing three generations of methods—from fixed‑size token cuts to sentence‑aware and semantic‑aware strategies—showing how refined chunking, overlap tuning, and metadata design raise Recall@5 from 0.67 to 0.91 while addressing table, list, and long‑section challenges.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
How Smart Chunk Splitting Boosts RAG Recall from 67% to 91%

Why Chunk Splitting Is the Foundation of RAG

Vector retrieval encodes a user query and each knowledge‑base chunk into vectors and ranks them by cosine similarity. If a chunk is semantically incomplete—e.g., it cuts a sentence, loses a table header, or isolates a list item—its embedding becomes an “information‑deficient” vector that rarely matches the correct query. In a corpus of 5,000 insurance PDFs (≈80 pages each), 30% deficient chunks hide a large portion of the knowledge, capping recall around 0.67.

Three Generations of Chunking Solutions

V1 – Fixed‑Length Token Splitting (Recall@5 = 0.67)

Split tokens into chunks of at most 512 tokens with a 50‑token overlap.

def chunk_v1(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    tokens = tokenizer.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(tokenizer.decode(tokens[start:end]))
        start += chunk_size - overlap
    return chunks

This method breaks structured documents, cutting sentences and discarding headings, which explains the low recall.

V2 – Sentence‑Level Splitting (Recall@5 = 0.74)

Split only at natural sentence boundaries and keep a configurable number of overlapping sentences.

import re

def chunk_v2(text: str, max_size: int = 512, overlap_sentences: int = 2) -> list[str]:
    sentences = re.split(r'(?<=[。!?;
])', text)
    chunks = []
    current_chunk = []
    current_len = 0
    for sent in sentences:
        sent_len = len(tokenizer.encode(sent))
        if current_len + sent_len > max_size and current_chunk:
            chunks.append(''.join(current_chunk))
            current_chunk = current_chunk[-overlap_sentences:]
            current_len = sum(len(tokenizer.encode(s)) for s in current_chunk)
        current_chunk.append(sent)
        current_len += sent_len
    if current_chunk:
        chunks.append(''.join(current_chunk))
    return chunks

Sentence‑aware splitting removes mid‑sentence cuts, raising recall to 0.74, but it still mixes hierarchy levels and mishandles tables and lists.

V3 – Semantic‑Aware, Hierarchy‑Preserving Splitting (Recall@5 = 0.91)

Core ideas:

Detect document hierarchy (chapters, sections, sub‑sections) and split only at semantic boundaries.

Recursively split overly long sections (default 1024 tokens, 1536 for key clauses).

Special handling for tables: keep small tables whole; split large tables by rows while copying the header to every chunk.

Merge list items with their preceding clause; retain the leading sentence in every resulting chunk.

Apply an experimentally validated overlap strategy.

3.1 Document Structure Detection

Insurance documents use mixed numbering schemes (e.g., "1 → 1.1 → 1.1.1", "第一条 → (一) → 1.", "第3条 保险责任 → 3.1 基本责任 → (1)身故保险金"). A multi‑pattern regex classifier is used to assign header levels.

import re
from enum import Enum

class HeaderLevel(Enum):
    H1 = 1  # 第X章/第X条
    H2 = 2  # X.X or (X)
    H3 = 3  # (1)/a) etc.

def detect_header_level(line: str) -> HeaderLevel | None:
    patterns = [
        (HeaderLevel.H1, r'^第[一二三四五六七八九十百\d]+[章条节]'),
        (HeaderLevel.H1, r'^\d+\.\s+[\u4e00-\u9fa5]'),
        (HeaderLevel.H2, r'^\d+\.\d+\s'),
        (HeaderLevel.H2, r'^([一二三四五六七八九十\d]+)'),
        (HeaderLevel.H3, r'^((\d+)|\d+)|[a-z]\))'),
    ]
    for level, pattern in patterns:
        if re.match(pattern, line.strip()):
            return level
    return None

All content belonging to the same detected section stays in a single chunk; cross‑section merging is prohibited.

3.2 Recursive Splitting of Over‑Long Sections

Key clauses often exceed the model’s context window. The algorithm first tries to split by sub‑headers; if none exist, it falls back to sentence‑aware splitting with a semantic completeness check.

def split_section(section_text: str, section_path: str, max_size: int = 1024) -> list[dict]:
    """Recursively split a single section.
    max_size: 1024 for normal sections, 1536 for critical clauses.
    """
    tokens = tokenizer.encode(section_text)
    if len(tokens) <= max_size:
        return [{"text": section_text, "section_path": section_path}]
    sub_sections = split_by_sub_headers(section_text)
    if len(sub_sections) > 1:
        result = []
        for sub in sub_sections:
            result.extend(split_section(
                sub["text"],
                f"{section_path} > {sub['title']}",
                max_size
            ))
        return result
    return sentence_aware_split(section_text, section_path, max_size)

3.3 Table Handling

Two categories:

Small tables (≤300 tokens) : keep as a single chunk.

Large tables (>300 tokens or spanning pages) : split by rows, prepend the full table header to each chunk.

def split_table(table_text: str, table_title: str, max_size: int = 300) -> list[dict]:
    """Split large tables, preserving the header in each chunk."""
    rows = parse_table_rows(table_text)
    header_rows = rows[:2]
    data_rows = rows[2:]
    header_text = "
".join(header_rows)
    header_tokens = len(tokenizer.encode(header_text))
    chunks = []
    current_rows = []
    current_tokens = header_tokens
    for row in data_rows:
        row_tokens = len(tokenizer.encode(row))
        if current_tokens + row_tokens > max_size and current_rows:
            chunks.append({"text": header_text + "
" + "
".join(current_rows),
                           "metadata": {"type": "table", "title": table_title}})
            current_rows = []
            current_tokens = header_tokens
        current_rows.append(row)
        current_tokens += row_tokens
    if current_rows:
        chunks.append({"text": header_text + "
" + "
".join(current_rows),
                       "metadata": {"type": "table", "title": table_title}})
    return chunks

Copying the header is essential for LLMs to interpret raw numbers.

3.4 List Item Handling

Insurance clauses often contain a leading sentence followed by enumerated items. V3 merges the leading sentence with all list items; if the merged text exceeds the size limit, the leading sentence is retained in every resulting chunk.

本保险以下情况不在承保范围之内:
(1)核辐射及核污染
(2)战争、军事冲突
(3)被保险人故意行为
...

3.5 Overlap Strategy Quantification

Experiments compared different overlap sizes. The best trade‑off was 100 tokens, which added ~10% storage while improving Recall@5 from 0.81 to 0.89. A sentence‑aware overlap that extends the overlap region to the nearest sentence end further raised Recall@5 to 0.91 by eliminating 87% of boundary‑cut sentences.

0 token – Recall@5 = 0.81 (baseline)

50 tokens – Recall@5 = 0.86 (+5% storage)

100 tokens – Recall@5 = 0.89 (+10% storage)

200 tokens – Recall@5 = 0.90 (+20% storage)

300 tokens – Recall@5 = 0.90 (+30% storage)

RAG three‑generation recall comparison
RAG three‑generation recall comparison

Chunk Metadata Design

Each chunk stores the following fields:

@dataclass
class ChunkMetadata:
    doc_id: str          # Original PDF identifier
    chunk_id: str        # Unique chunk identifier
    section_path: str    # Hierarchical path, e.g., "第3条 保险责任 > 3.2 责任免除"
    chunk_type: str      # "text" | "table" | "list"
    is_key_clause: bool # Marks critical clauses (responsibility, exemption, rate)
    prev_chunk_id: str   # ID of the previous chunk
    next_chunk_id: str   # ID of the next chunk
    token_count: int     # Token count of the chunk
    page_range: str      # Source page range

section_path enables answer provenance; is_key_clause applies a 1.5× weight during retrieval (identified by keyword matching with 94% accuracy); prev/next_chunk_id supports automatic context expansion when a retrieved chunk is incomplete.

Evaluating Chunk Quality

Build a QA test set: 200 insurance documents, 10 questions each (2,000 QA pairs). Each question is annotated with the ground‑truth chunk.

Run end‑to‑end retrieval and check whether the ground‑truth chunk appears in the top‑5 results.

Break down recall by content type (plain text, table, list, multi‑hop) to locate weaknesses.

This analysis showed V2’s table recall was only 0.51, prompting the table‑header fix in V3, which dramatically lifted overall recall.

Common Pitfalls

Pitfall 1: Chunk Size Too Small

Using 512 tokens for insurance clauses (average sentence length 1.5× that of generic documents) left many chunks semantically incomplete. Increasing to 1,024 tokens (1,536 for key clauses) added ~7 percentage points to recall.

Pitfall 2: Lost Table Headers

Without header duplication, LLMs cannot interpret raw numbers, resulting in a 43% correct‑answer rate for table queries. Duplicating the header raised this to 78%.

Pitfall 3: Isolated List Items

Standalone list items lose their contextual clause, dropping recall for negative‑query questions to 0.58. Merging the leading sentence raised it to 0.83.

Impact of chunking strategies on different question types
Impact of chunking strategies on different question types

Frontier Directions

Semantic Chunking

Use embeddings to locate semantic jump points (sharp drops in similarity between adjacent sentences). This yields finer boundaries but incurs high inference cost and is unstable on short texts, so it remains a future candidate.

Late Chunking

Encode whole documents (or large sections) with a long‑context model, then split in the embedding space. It preserves full‑context information but currently requires 8–10× more memory than rule‑based methods, limiting production use.

Dynamic Chunk Size

Adjust chunk size based on semantic density: key clauses → 1,536 tokens, dense sections → 1,024, brief introductions → 512. Early A/B tests show a 3–5% recall gain for multi‑hop queries, though the density‑scoring step itself consumes additional LLM inference time.

LLMRAGInformation RetrievalChunkingrecall optimization
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.