Mastering Text Chunking: 21 Strategies to Supercharge Your RAG Pipelines
This comprehensive guide presents 21 practical text‑chunking techniques—from simple line‑based splits to advanced embedding‑ and LLM‑driven methods—explaining their implementations, code examples, and ideal use‑cases to help you build efficient Retrieval‑Augmented Generation systems while avoiding common pitfalls.
Why Chunking Matters for RAG
When building Retrieval‑Augmented Generation (RAG) pipelines, the way you split documents into chunks directly affects retrieval relevance and generation quality. Too large chunks introduce noise; too small chunks lose context. This article systematically reviews 21 chunking strategies, provides ready‑to‑run Python code, and offers guidance on when to choose each method.
Basic Chunking Strategies (6)
1. Naïve Line Chunking
Split text at every newline character.
def naive_chunking(text: str):
"""Split text by line breaks"""
chunks = text.split('
')
chunks = [c.strip() for c in chunks if c.strip()]
return chunks
sample_text = """Neural networks consist of input, hidden, and output layers.
Back‑propagation is the key training algorithm.
Gradient descent optimises the weights."""
for i, chunk in enumerate(naive_chunking(sample_text), 1):
print(f"Chunk {i}: {chunk}")When to use: Documents already organized by line (notes, FAQs, chat logs).
2. Fixed‑Size Chunking
Divide text into equal‑sized word windows, optionally with overlap.
def fixed_size_chunking(text: str, chunk_size: int = 100, overlap: int = 0):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunksWhen to use: Raw dumps, scanned documents, or any unstructured text without clear delimiters.
3. Sliding Window Chunking
Same as fixed‑size but each window overlaps with the previous one, preserving context.
def sliding_window_chunking(text: str, chunk_size: int = 100, overlap: int = 20):
return fixed_size_chunking(text, chunk_size, overlap)When to use: Long narratives where continuity between chunks matters.
4. Sentence‑Based Chunking
Split at sentence boundaries using a regular expression.
import re
def sentence_chunking(text: str):
sentences = re.split(r'(?<=[。!?.!?])\s+', text.strip())
return [s for s in sentences if s]When to use: Well‑written prose where each sentence conveys a complete idea.
5. Paragraph‑Based Chunking
Split on double newlines, keeping each paragraph intact.
def paragraph_chunking(text: str):
paragraphs = text.split('
')
return [p.strip() for p in paragraphs if p.strip()]When to use: Articles, manuals, or reports where paragraphs are logical units.
6. Page‑Based Chunking (PDF)
Extract each physical page from a PDF using PyPDF2 and keep page metadata.
import PyPDF2
def page_based_chunking(pdf_path: str, start_page: int = 0, end_page: int = None):
chunks = []
with open(pdf_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
end_page = end_page or len(reader.pages)
for i in range(start_page, end_page):
page = reader.pages[i]
text = page.extract_text()
if text:
chunks.append({
'type': 'page',
'content': text.strip(),
'metadata': {'page_number': i + 1, 'total_pages': len(reader.pages)}
})
return chunksWhen to use: Legal contracts, academic papers, or any PDF where page references matter.
Structured Chunking Strategies (7)
7. Structured (JSON/XML/CSV) Chunking
Recursively walk hierarchical data structures and emit chunks that respect the inherent hierarchy.
import json, xml.etree.ElementTree as ET
def structured_json_chunking(json_data, max_items: int = 10):
if isinstance(json_data, str):
json_data = json.loads(json_data)
chunks = []
def walk(node, path=''):
if isinstance(node, dict):
for k, v in node.items():
new_path = f"{path}.{k}" if path else k
if isinstance(v, (dict, list)):
walk(v, new_path)
else:
chunks.append({'type': 'json', 'path': new_path, 'content': str(v)})
elif isinstance(node, list):
for i, item in enumerate(node):
walk(item, f"{path}[{i}]")
walk(json_data)
return chunksWhen to use: Config files, API responses, logs.
8. Document‑Structure Chunking
Use Markdown or HTML headings as split points.
import re
def document_structure_chunking(text: str):
sections = re.split(r'(?=^#{1,3}\s)', text, flags=re.MULTILINE)
chunks = []
for sec in sections:
if not sec.strip():
continue
title_match = re.match(r'^(#{1,3})\s+(.+)$', sec.split('
')[0])
if title_match:
level = len(title_match.group(1))
title = title_match.group(2).strip()
body = '
'.join(sec.split('
')[1:]).strip()
chunks.append({'level': level, 'title': title, 'content': body})
else:
chunks.append({'level': 0, 'title': 'Untitled', 'content': sec.strip()})
return chunksWhen to use: Technical documentation, books, articles with clear headings.
9. Keyword‑Based Chunking
Split whenever a predefined keyword appears.
def keyword_based_chunking(text: str, keywords: list):
pattern = '|'.join(map(re.escape, keywords))
parts = re.split(f'(?=({pattern}))', text)
chunks = []
current = ''
for part in parts:
if any(part.startswith(k) for k in keywords):
if current:
chunks.append(current.strip())
current = part
else:
current += part
if current:
chunks.append(current.strip())
return chunksWhen to use: Meeting minutes, logs where specific markers denote new sections.
10. Entity‑Based Chunking
Run a Named Entity Recogniser (e.g., spaCy) and group sentences by shared entities.
import spacy
def entity_based_chunking(text: str):
nlp = spacy.load('zh_core_web_sm') # replace with appropriate model
doc = nlp(text)
entity_map = {}
for ent in doc.ents:
entity_map.setdefault(ent.text, []).append(ent.sent.text)
chunks = []
for entity, sentences in entity_map.items():
chunks.append({'entity': entity, 'content': ' '.join(sentences)})
return chunksWhen to use: News articles, contracts, or any text where entities are central.
11. Token‑Based Chunking
Count tokens using a tokenizer (e.g., tiktoken) and enforce a maximum token limit per chunk.
import tiktoken
def token_based_chunking(text: str, model_name: str = 'gpt-4', max_tokens: int = 100):
enc = tiktoken.encoding_for_model(model_name)
tokens = enc.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i+max_tokens]
chunks.append(enc.decode(chunk_tokens))
return chunksWhen to use: When you must stay within LLM token limits.
12. Table‑Aware Chunking
Detect ASCII‑style tables, keep them intact, and treat surrounding text as separate chunks.
def table_aware_chunking(text: str):
table_pat = r'(\+[-]+\+.*?)(?=
|\Z)'
tables = re.findall(table_pat, text, re.MULTILINE)
non_table = re.sub(table_pat, '', text, flags=re.MULTILINE)
chunks = [{'type': 'text', 'content': p.strip()} for p in re.split(r'
\s*
', non_table) if p.strip()]
for tbl in tables:
md = tbl.replace('+', '|')
chunks.append({'type': 'table', 'content': md})
return chunksWhen to use: Financial reports, data tables, specifications.
13. Content‑Aware Chunking
Detect the type of each block (list, code, heading, table, quote, paragraph) and store metadata.
def content_aware_chunking(text: str):
blocks = re.split(r'
\s*
', text)
chunks = []
for blk in blocks:
blk = blk.strip()
if not blk:
continue
if re.match(r'^\s*[\d•\-\*]\s+', blk):
typ = 'list'
elif blk.startswith('```') or re.match(r'^ {4,}', blk, re.MULTILINE):
typ = 'code'
elif re.match(r'^#{1,3}\s+', blk):
typ = 'heading'
elif re.match(r'^\|.+\|$', blk, re.MULTILINE) or re.match(r'^\+[-]+\+$', blk, re.MULTILINE):
typ = 'table'
elif blk.startswith('>'):
typ = 'quote'
else:
typ = 'paragraph'
chunks.append({'type': typ, 'content': blk})
return chunksWhen to use: Mixed‑format documents where preserving format matters.
Intelligent Chunking Strategies (8)
14. Topic‑Based Chunking
Apply LDA or clustering to group sentences by latent topics.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
def topic_based_chunking(texts: list, n_topics: int = 3):
sentences = []
for txt in texts:
sentences.extend([s.strip() for s in re.split(r'[。!?.!?]+', txt) if s.strip()])
vec = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vec.fit_transform(sentences)
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10, learning_method='online', random_state=42)
lda.fit(X)
topics = lda.transform(X)
groups = {i: [] for i in range(n_topics)}
for i, probs in enumerate(topics):
groups[np.argmax(probs)].append(sentences[i])
chunks = []
for tid, sents in groups.items():
if sents:
chunks.append({'topic_id': tid, 'content': ' '.join(sents), 'sentence_count': len(sents)})
return chunksWhen to use: Documents covering multiple themes without explicit headings.
15. Contextual Chunking (LLM‑Enhanced)
Prompt an LLM to generate concise context metadata for each chunk.
import openai
def contextual_chunking(texts: list, prompt: str = None):
if not prompt:
prompt = """Provide for each text block: 1) core keywords, 2) possible link to previous block, 3) key entities, 4) sentiment. Return JSON."""
enriched = []
for i, txt in enumerate(texts):
try:
resp = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[{'role': 'system', 'content': 'You are a professional text analyst.'},
{'role': 'user', 'content': f"{prompt}
Text block:
{txt}"}],
temperature=0.3)
ctx = resp.choices[0].message.content
try:
ctx_json = json.loads(ctx)
except:
ctx_json = {'raw_context': ctx}
enriched.append({'original_text': txt, 'context': ctx_json, 'enhanced_text': f"Context: {ctx}
Original: {txt}"})
except Exception as e:
enriched.append({'original_text': txt, 'context': {}, 'enhanced_text': txt})
return enrichedWhen to use: Complex legal or scientific documents where nuanced context improves retrieval.
16. Semantic Chunking
Group sentences whose embeddings have cosine similarity above a threshold.
from sentence_transformers import SentenceTransformer
import numpy as np, re
def semantic_chunking(text: str, threshold: float = 0.7):
sentences = [s.strip() for s in re.split(r'[。!?.!?]+', text) if s.strip()]
if len(sentences) <= 1:
return [text]
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
emb = model.encode(sentences)
chunks = []
cur = [sentences[0]]
for i in range(1, len(sentences)):
sim = np.dot(emb[i], emb[i-1]) / (np.linalg.norm(emb[i]) * np.linalg.norm(emb[i-1]))
if sim > threshold:
cur.append(sentences[i])
else:
chunks.append(' '.join(cur))
cur = [sentences[i]]
if cur:
chunks.append(' '.join(cur))
return chunksWhen to use: Long narratives where thematic continuity is not captured by simple delimiters.
17. Recursive Chunking
Apply a hierarchy of separators (paragraph → sentence → fixed size) until all chunks satisfy a size limit.
def recursive_chunking(text: str, separators=None, max_len: int = 100):
if separators is None:
separators = ['
', '。', '!', '?', '.', ' ']
def split(chunk, idx=0):
if len(chunk) <= max_len or idx >= len(separators):
return [chunk.strip()] if chunk.strip() else []
sep = separators[idx]
parts = chunk.split(sep)
if sep != ' ':
parts = [p + sep for p in parts[:-1]] + [parts[-1]]
result = []
for part in parts:
if len(part) > max_len:
result.extend(split(part, idx+1))
else:
result.append(part.strip())
return result
return split(text)When to use: Interviews, speeches, or any free‑form text with unpredictable length.
18. Embedding‑Based Chunking (Similarity Merge, Clustering, Sliding Window)
A class that loads a sentence‑transformer model and offers three merging strategies.
from sentence_transformers import SentenceTransformer
import numpy as np, re
from sklearn.cluster import AgglomerativeClustering
class EmbeddingChunker:
def __init__(self, model_name='paraphrase-multilingual-MiniLM-L12-v2'):
self.model = SentenceTransformer(model_name)
def _split_units(self, text):
return [s.strip() for s in re.split(r'[。!?.!?]+', text) if s.strip()]
def _cosine(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def similarity_merge(self, text, threshold=0.7, max_size=500):
units = self._split_units(text)
emb = self.model.encode(units)
chunks = []
cur_units, cur_emb = [units[0]], [emb[0]]
for i in range(1, len(units)):
sim = self._cosine(emb[i], np.mean(cur_emb, axis=0))
cur_text = ' '.join(cur_units + [units[i]])
if sim > threshold and len(cur_text) <= max_size:
cur_units.append(units[i]); cur_emb.append(emb[i])
else:
chunks.append({'content': ' '.join(cur_units)})
cur_units, cur_emb = [units[i]], [emb[i]]
if cur_units:
chunks.append({'content': ' '.join(cur_units)})
return chunks
def clustering_merge(self, text, min_size=30):
units = self._split_units(text)
emb = self.model.encode(units)
n_clusters = max(2, len(units)//5)
clustering = AgglomerativeClustering(n_clusters=n_clusters, affinity='cosine', linkage='average')
labels = clustering.fit_predict(emb)
chunks = []
for cid in range(n_clusters):
idx = np.where(labels == cid)[0]
if len(idx) * len(units[0]) >= min_size:
content = ' '.join([units[i] for i in idx])
chunks.append({'cluster_id': cid, 'content': content})
return chunks
def sliding_window_merge(self, text, threshold=0.6, window=3):
units = self._split_units(text)
emb = self.model.encode(units)
i = 0
chunks = []
while i < len(units):
end = min(i+window, len(units))
sims = [self._cosine(emb[j], emb[k]) for j in range(i, end) for k in range(j+1, end)]
avg = np.mean(sims) if sims else 0
if avg > threshold or end - i == 1:
while end < len(units):
new_sim = np.mean([self._cosine(emb[end], emb[j]) for j in range(i, end)])
if new_sim > threshold * 0.9:
end += 1
else:
break
chunks.append({'content': ' '.join(units[i:end])})
i = end
else:
i += 1
return chunksWhen to use: When you need data‑driven chunk boundaries and have compute resources.
19. Agentic / LLM‑Based Chunking
Let a large language model decide where to split the text.
import openai, json
def llm_based_chunking(text: str, api_key: str, model: str = 'gpt-3.5-turbo'):
openai.api_key = api_key
prompt = f"""Split the following text into semantically complete chunks of about 100‑200 words. Return a JSON object with a \"chunks\" array.
Text:
{text[:1000]}"""
try:
resp = openai.ChatCompletion.create(model=model, messages=[{'role':'system','content':'You are a text‑analysis assistant.'},{'role':'user','content':prompt}], temperature=0.3)
result = json.loads(resp.choices[0].message.content)
return result.get('chunks', [text])
except Exception as e:
print(f"LLM chunking failed: {e}")
return recursive_chunking(text)When to use: Highly unstructured or domain‑specific documents where human‑like judgment is required, and cost is acceptable.
20. Hierarchical Chunking
Build a multi‑level tree (document → chapters → sections → paragraphs → sentences) to support multi‑granularity retrieval.
class HierarchicalChunker:
def __init__(self):
self.hierarchy = {'document': None, 'chapters': [], 'sections': [], 'paragraphs': [], 'sentences': []}
def build_hierarchy(self, text: str, max_depth: int = 4):
self.hierarchy['document'] = {'content': text[:500] + ('...' if len(text) > 500 else ''), 'full_content': text}
self._split_chapters(text, max_depth)
return self.hierarchy
def _split_chapters(self, text, depth):
if depth <= 0:
return
patterns = [r'^(#{1,3})\s+(.+)$', r'^第[一二三四五六七八九十\d]+章\s+.+$', r'^Chapter\s+\d+[:.-]?\s+.+$']
matches = []
for pat in patterns:
m = list(re.finditer(pat, text, re.MULTILINE))
if m and (not matches or len(m) > len(matches)):
matches = m
if not matches:
self._split_paragraphs(text, depth)
return
chapters = []
last = 0
for i, m in enumerate(matches):
start = m.start()
if start > last:
chapters.append({'title': f'Part {i}', 'content': text[last:start].strip()})
last = start
chapters.append({'title': f'Part {len(matches)+1}', 'content': text[last:].strip()})
self.hierarchy['chapters'] = chapters
for chap in chapters:
self._split_sections(chap['content'], depth-1, chap['title'])
def _split_sections(self, text, depth, parent_title):
if depth <= 0:
return
matches = list(re.finditer(r'^(#{2,4})\s+(.+)$', text, re.MULTILINE))
if not matches:
self._split_paragraphs(text, depth)
return
sections = []
last = 0
for i, m in enumerate(matches):
start = m.start()
if start > last:
sections.append({'parent': parent_title, 'title': f'Section {i}', 'content': text[last:start].strip()})
last = start
sections.append({'parent': parent_title, 'title': f'Section {len(matches)+1}', 'content': text[last:].strip()})
self.hierarchy['sections'].extend(sections)
for sec in sections:
self._split_paragraphs(sec['content'], depth-1)
def _split_paragraphs(self, text, depth):
paras = [p.strip() for p in re.split(r'
\s*
', text) if p.strip()]
for i, p in enumerate(paras, 1):
para_obj = {'id': f'para_{len(self.hierarchy["paragraphs"]) + 1}', 'content': p, 'sentence_count': len(re.split(r'[。!?.!?]+', p))}
self.hierarchy['paragraphs'].append(para_obj)
if depth > 3:
self._split_sentences(p)
def _split_sentences(self, text):
sents = [s.strip() for s in re.split(r'[。!?.!?]+', text) if s.strip()]
for i, s in enumerate(sents, 1):
self.hierarchy['sentences'].append({'id': f'sent_{len(self.hierarchy["sentences"]) + 1}', 'content': s})
def get_chunks_at_level(self, level: str, min_length: int = 0):
if level not in self.hierarchy:
return []
return [c for c in self.hierarchy[level] if len(c.get('content','')) >= min_length]When to use: Books, encyclopedias, knowledge bases that need multi‑granular access.
21. Modality‑Aware Chunking
Detect and separately process text, images, tables, and code within a document.
import re, json
from PIL import Image
import pytesseract
import pandas as pd
class MultiModalChunker:
def __init__(self):
self.chunks = []
def process(self, doc_path: str):
if doc_path.lower().endswith('.pdf'):
return self._process_pdf(doc_path)
elif doc_path.lower().endswith(('.png', '.jpg', '.jpeg')):
return self._process_image(doc_path)
elif doc_path.lower().endswith('.docx'):
return self._process_docx(doc_path)
else:
return self._process_text(doc_path)
def _process_text(self, path):
with open(path, 'r', encoding='utf-8') as f:
text = f.read()
return self._chunk_by_type(text)
def _chunk_by_type(self, text):
lines = text.split('
')
cur, cur_type = [], None
for line in lines:
line = line.rstrip()
if not line:
continue
if re.match(r'^\s*[\d•\-\*]\s+', line):
typ = 'list'
elif line.startswith('```') or re.match(r'^ {4,}', line, re.MULTILINE):
typ = 'code'
elif re.match(r'^#{1,6}\s+', line):
typ = 'heading'
elif re.match(r'^\|.+\|$', line) or re.match(r'^\+[-]+\+$', line):
typ = 'table'
elif line.startswith('>'):
typ = 'quote'
else:
typ = 'text'
if typ != cur_type and cur:
self._save_chunk(cur, cur_type)
cur = []
cur_type = typ
cur.append(line)
if cur:
self._save_chunk(cur, cur_type)
return self.chunks
def _save_chunk(self, lines, typ):
content = '
'.join(lines)
meta = {'line_count': len(lines), 'type': typ}
if typ == 'table':
content = self._process_table(content)
self.chunks.append({'type': typ, 'content': content, 'metadata': meta})
def _process_table(self, txt):
try:
rows = [r.strip() for r in txt.split('
') if '|' in r]
data = [ [c.strip() for c in r.split('|') if c.strip()] for r in rows]
if len(data) >= 2:
headers = data[0]
body = data[1:]
return json.dumps({'headers': headers, 'rows': body, 'row_count': len(body), 'col_count': len(headers)})
except:
pass
return txtWhen to use: Technical manuals, product datasheets, or any document mixing text, code, tables, and images.
Choosing the Right Strategy – Decision Guide
Ask four key questions:
Document type: Structured (Markdown/HTML/JSON) → use document‑structure or structured chunking; semi‑structured (reports, papers) → paragraph or recursive chunking; unstructured (scans, chats) → fixed‑size, sliding‑window, or semantic chunking.
Query characteristics: Fact‑looking → sentence‑level chunks; analytical → paragraph‑level; mixed → hierarchical or multi‑granular chunks.
Resource constraints: Limited compute → rule‑based (fixed‑size, line, sentence); ample compute → embedding‑based, LLM‑based.
Real‑time requirement: Avoid heavy models (LLM, clustering) for latency‑critical pipelines.
Common Pitfalls and How to Avoid Them
Too fragmented: Increase chunk size or add overlap.
Too large: Reduce size or switch to fixed‑size/window methods.
Ignoring structure: Preserve tables, code blocks, and headings using structure‑aware or table‑aware chunking.
Multilingual punctuation: Use language‑specific sentence splitters (e.g., spaCy multilingual models).
Practical Tips
Start with simple recursive chunking; iterate to more sophisticated methods only if retrieval quality suffers.
Validate chunking by running real queries against your vector store and inspecting results.
Combine strategies: e.g., apply document‑structure chunking first, then semantic merging within each section.
Conclusion
Effective chunking is the foundation of a performant RAG system. By understanding the 21 strategies outlined above, you can select, combine, and fine‑tune the approach that best fits your data, resources, and use‑case, ultimately delivering more accurate and context‑aware generative AI applications.
About the Authors
Data‑Pai THU is a data‑science community backed by Tsinghua University’s Big Data Research Center. The article was edited by Yu Teng‑Kai and proof‑read by Lin Yi‑Lin.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
