How Smart Chunk Splitting Boosts RAG Recall from 67% to 91%
This article examines the critical role of chunk splitting in Retrieval‑Augmented Generation systems, comparing three generations of methods—from fixed‑size token cuts to sentence‑aware and semantic‑aware strategies—showing how refined chunking, overlap tuning, and metadata design raise Recall@5 from 0.67 to 0.91 while addressing table, list, and long‑section challenges.
Why Chunk Splitting Is the Foundation of RAG
Vector retrieval encodes a user query and each knowledge‑base chunk into vectors and ranks them by cosine similarity. If a chunk is semantically incomplete—e.g., it cuts a sentence, loses a table header, or isolates a list item—its embedding becomes an “information‑deficient” vector that rarely matches the correct query. In a corpus of 5,000 insurance PDFs (≈80 pages each), 30% deficient chunks hide a large portion of the knowledge, capping recall around 0.67.
Three Generations of Chunking Solutions
V1 – Fixed‑Length Token Splitting (Recall@5 = 0.67)
Split tokens into chunks of at most 512 tokens with a 50‑token overlap.
def chunk_v1(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_size, len(tokens))
chunks.append(tokenizer.decode(tokens[start:end]))
start += chunk_size - overlap
return chunksThis method breaks structured documents, cutting sentences and discarding headings, which explains the low recall.
V2 – Sentence‑Level Splitting (Recall@5 = 0.74)
Split only at natural sentence boundaries and keep a configurable number of overlapping sentences.
import re
def chunk_v2(text: str, max_size: int = 512, overlap_sentences: int = 2) -> list[str]:
sentences = re.split(r'(?<=[。!?;
])', text)
chunks = []
current_chunk = []
current_len = 0
for sent in sentences:
sent_len = len(tokenizer.encode(sent))
if current_len + sent_len > max_size and current_chunk:
chunks.append(''.join(current_chunk))
current_chunk = current_chunk[-overlap_sentences:]
current_len = sum(len(tokenizer.encode(s)) for s in current_chunk)
current_chunk.append(sent)
current_len += sent_len
if current_chunk:
chunks.append(''.join(current_chunk))
return chunksSentence‑aware splitting removes mid‑sentence cuts, raising recall to 0.74, but it still mixes hierarchy levels and mishandles tables and lists.
V3 – Semantic‑Aware, Hierarchy‑Preserving Splitting (Recall@5 = 0.91)
Core ideas:
Detect document hierarchy (chapters, sections, sub‑sections) and split only at semantic boundaries.
Recursively split overly long sections (default 1024 tokens, 1536 for key clauses).
Special handling for tables: keep small tables whole; split large tables by rows while copying the header to every chunk.
Merge list items with their preceding clause; retain the leading sentence in every resulting chunk.
Apply an experimentally validated overlap strategy.
3.1 Document Structure Detection
Insurance documents use mixed numbering schemes (e.g., "1 → 1.1 → 1.1.1", "第一条 → (一) → 1.", "第3条 保险责任 → 3.1 基本责任 → (1)身故保险金"). A multi‑pattern regex classifier is used to assign header levels.
import re
from enum import Enum
class HeaderLevel(Enum):
H1 = 1 # 第X章/第X条
H2 = 2 # X.X or (X)
H3 = 3 # (1)/a) etc.
def detect_header_level(line: str) -> HeaderLevel | None:
patterns = [
(HeaderLevel.H1, r'^第[一二三四五六七八九十百\d]+[章条节]'),
(HeaderLevel.H1, r'^\d+\.\s+[\u4e00-\u9fa5]'),
(HeaderLevel.H2, r'^\d+\.\d+\s'),
(HeaderLevel.H2, r'^([一二三四五六七八九十\d]+)'),
(HeaderLevel.H3, r'^((\d+)|\d+)|[a-z]\))'),
]
for level, pattern in patterns:
if re.match(pattern, line.strip()):
return level
return NoneAll content belonging to the same detected section stays in a single chunk; cross‑section merging is prohibited.
3.2 Recursive Splitting of Over‑Long Sections
Key clauses often exceed the model’s context window. The algorithm first tries to split by sub‑headers; if none exist, it falls back to sentence‑aware splitting with a semantic completeness check.
def split_section(section_text: str, section_path: str, max_size: int = 1024) -> list[dict]:
"""Recursively split a single section.
max_size: 1024 for normal sections, 1536 for critical clauses.
"""
tokens = tokenizer.encode(section_text)
if len(tokens) <= max_size:
return [{"text": section_text, "section_path": section_path}]
sub_sections = split_by_sub_headers(section_text)
if len(sub_sections) > 1:
result = []
for sub in sub_sections:
result.extend(split_section(
sub["text"],
f"{section_path} > {sub['title']}",
max_size
))
return result
return sentence_aware_split(section_text, section_path, max_size)3.3 Table Handling
Two categories:
Small tables (≤300 tokens) : keep as a single chunk.
Large tables (>300 tokens or spanning pages) : split by rows, prepend the full table header to each chunk.
def split_table(table_text: str, table_title: str, max_size: int = 300) -> list[dict]:
"""Split large tables, preserving the header in each chunk."""
rows = parse_table_rows(table_text)
header_rows = rows[:2]
data_rows = rows[2:]
header_text = "
".join(header_rows)
header_tokens = len(tokenizer.encode(header_text))
chunks = []
current_rows = []
current_tokens = header_tokens
for row in data_rows:
row_tokens = len(tokenizer.encode(row))
if current_tokens + row_tokens > max_size and current_rows:
chunks.append({"text": header_text + "
" + "
".join(current_rows),
"metadata": {"type": "table", "title": table_title}})
current_rows = []
current_tokens = header_tokens
current_rows.append(row)
current_tokens += row_tokens
if current_rows:
chunks.append({"text": header_text + "
" + "
".join(current_rows),
"metadata": {"type": "table", "title": table_title}})
return chunksCopying the header is essential for LLMs to interpret raw numbers.
3.4 List Item Handling
Insurance clauses often contain a leading sentence followed by enumerated items. V3 merges the leading sentence with all list items; if the merged text exceeds the size limit, the leading sentence is retained in every resulting chunk.
本保险以下情况不在承保范围之内:
(1)核辐射及核污染
(2)战争、军事冲突
(3)被保险人故意行为
...3.5 Overlap Strategy Quantification
Experiments compared different overlap sizes. The best trade‑off was 100 tokens, which added ~10% storage while improving Recall@5 from 0.81 to 0.89. A sentence‑aware overlap that extends the overlap region to the nearest sentence end further raised Recall@5 to 0.91 by eliminating 87% of boundary‑cut sentences.
0 token – Recall@5 = 0.81 (baseline)
50 tokens – Recall@5 = 0.86 (+5% storage)
100 tokens – Recall@5 = 0.89 (+10% storage)
200 tokens – Recall@5 = 0.90 (+20% storage)
300 tokens – Recall@5 = 0.90 (+30% storage)
Chunk Metadata Design
Each chunk stores the following fields:
@dataclass
class ChunkMetadata:
doc_id: str # Original PDF identifier
chunk_id: str # Unique chunk identifier
section_path: str # Hierarchical path, e.g., "第3条 保险责任 > 3.2 责任免除"
chunk_type: str # "text" | "table" | "list"
is_key_clause: bool # Marks critical clauses (responsibility, exemption, rate)
prev_chunk_id: str # ID of the previous chunk
next_chunk_id: str # ID of the next chunk
token_count: int # Token count of the chunk
page_range: str # Source page rangesection_path enables answer provenance; is_key_clause applies a 1.5× weight during retrieval (identified by keyword matching with 94% accuracy); prev/next_chunk_id supports automatic context expansion when a retrieved chunk is incomplete.
Evaluating Chunk Quality
Build a QA test set: 200 insurance documents, 10 questions each (2,000 QA pairs). Each question is annotated with the ground‑truth chunk.
Run end‑to‑end retrieval and check whether the ground‑truth chunk appears in the top‑5 results.
Break down recall by content type (plain text, table, list, multi‑hop) to locate weaknesses.
This analysis showed V2’s table recall was only 0.51, prompting the table‑header fix in V3, which dramatically lifted overall recall.
Common Pitfalls
Pitfall 1: Chunk Size Too Small
Using 512 tokens for insurance clauses (average sentence length 1.5× that of generic documents) left many chunks semantically incomplete. Increasing to 1,024 tokens (1,536 for key clauses) added ~7 percentage points to recall.
Pitfall 2: Lost Table Headers
Without header duplication, LLMs cannot interpret raw numbers, resulting in a 43% correct‑answer rate for table queries. Duplicating the header raised this to 78%.
Pitfall 3: Isolated List Items
Standalone list items lose their contextual clause, dropping recall for negative‑query questions to 0.58. Merging the leading sentence raised it to 0.83.
Frontier Directions
Semantic Chunking
Use embeddings to locate semantic jump points (sharp drops in similarity between adjacent sentences). This yields finer boundaries but incurs high inference cost and is unstable on short texts, so it remains a future candidate.
Late Chunking
Encode whole documents (or large sections) with a long‑context model, then split in the embedding space. It preserves full‑context information but currently requires 8–10× more memory than rule‑based methods, limiting production use.
Dynamic Chunk Size
Adjust chunk size based on semantic density: key clauses → 1,536 tokens, dense sections → 1,024, brief introductions → 512. Early A/B tests show a 3–5% recall gain for multi‑hop queries, though the density‑scoring step itself consumes additional LLM inference time.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
