Mastering Text Chunking: 21 Strategies to Supercharge Your RAG Pipelines

This comprehensive guide presents 21 practical text‑chunking techniques—from simple line‑based splits to advanced embedding‑ and LLM‑driven methods—explaining their implementations, code examples, and ideal use‑cases to help you build efficient Retrieval‑Augmented Generation systems while avoiding common pitfalls.

Data Party THU
Data Party THU
Data Party THU
Mastering Text Chunking: 21 Strategies to Supercharge Your RAG Pipelines

Why Chunking Matters for RAG

When building Retrieval‑Augmented Generation (RAG) pipelines, the way you split documents into chunks directly affects retrieval relevance and generation quality. Too large chunks introduce noise; too small chunks lose context. This article systematically reviews 21 chunking strategies, provides ready‑to‑run Python code, and offers guidance on when to choose each method.

Basic Chunking Strategies (6)

1. Naïve Line Chunking

Split text at every newline character.

def naive_chunking(text: str):
    """Split text by line breaks"""
    chunks = text.split('
')
    chunks = [c.strip() for c in chunks if c.strip()]
    return chunks

sample_text = """Neural networks consist of input, hidden, and output layers.
Back‑propagation is the key training algorithm.
Gradient descent optimises the weights."""
for i, chunk in enumerate(naive_chunking(sample_text), 1):
    print(f"Chunk {i}: {chunk}")

When to use: Documents already organized by line (notes, FAQs, chat logs).

2. Fixed‑Size Chunking

Divide text into equal‑sized word windows, optionally with overlap.

def fixed_size_chunking(text: str, chunk_size: int = 100, overlap: int = 0):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

When to use: Raw dumps, scanned documents, or any unstructured text without clear delimiters.

3. Sliding Window Chunking

Same as fixed‑size but each window overlaps with the previous one, preserving context.

def sliding_window_chunking(text: str, chunk_size: int = 100, overlap: int = 20):
    return fixed_size_chunking(text, chunk_size, overlap)

When to use: Long narratives where continuity between chunks matters.

4. Sentence‑Based Chunking

Split at sentence boundaries using a regular expression.

import re

def sentence_chunking(text: str):
    sentences = re.split(r'(?<=[。!?.!?])\s+', text.strip())
    return [s for s in sentences if s]

When to use: Well‑written prose where each sentence conveys a complete idea.

5. Paragraph‑Based Chunking

Split on double newlines, keeping each paragraph intact.

def paragraph_chunking(text: str):
    paragraphs = text.split('

')
    return [p.strip() for p in paragraphs if p.strip()]

When to use: Articles, manuals, or reports where paragraphs are logical units.

6. Page‑Based Chunking (PDF)

Extract each physical page from a PDF using PyPDF2 and keep page metadata.

import PyPDF2

def page_based_chunking(pdf_path: str, start_page: int = 0, end_page: int = None):
    chunks = []
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        end_page = end_page or len(reader.pages)
        for i in range(start_page, end_page):
            page = reader.pages[i]
            text = page.extract_text()
            if text:
                chunks.append({
                    'type': 'page',
                    'content': text.strip(),
                    'metadata': {'page_number': i + 1, 'total_pages': len(reader.pages)}
                })
    return chunks

When to use: Legal contracts, academic papers, or any PDF where page references matter.

Structured Chunking Strategies (7)

7. Structured (JSON/XML/CSV) Chunking

Recursively walk hierarchical data structures and emit chunks that respect the inherent hierarchy.

import json, xml.etree.ElementTree as ET

def structured_json_chunking(json_data, max_items: int = 10):
    if isinstance(json_data, str):
        json_data = json.loads(json_data)
    chunks = []
    def walk(node, path=''):
        if isinstance(node, dict):
            for k, v in node.items():
                new_path = f"{path}.{k}" if path else k
                if isinstance(v, (dict, list)):
                    walk(v, new_path)
                else:
                    chunks.append({'type': 'json', 'path': new_path, 'content': str(v)})
        elif isinstance(node, list):
            for i, item in enumerate(node):
                walk(item, f"{path}[{i}]")
    walk(json_data)
    return chunks

When to use: Config files, API responses, logs.

8. Document‑Structure Chunking

Use Markdown or HTML headings as split points.

import re

def document_structure_chunking(text: str):
    sections = re.split(r'(?=^#{1,3}\s)', text, flags=re.MULTILINE)
    chunks = []
    for sec in sections:
        if not sec.strip():
            continue
        title_match = re.match(r'^(#{1,3})\s+(.+)$', sec.split('
')[0])
        if title_match:
            level = len(title_match.group(1))
            title = title_match.group(2).strip()
            body = '
'.join(sec.split('
')[1:]).strip()
            chunks.append({'level': level, 'title': title, 'content': body})
        else:
            chunks.append({'level': 0, 'title': 'Untitled', 'content': sec.strip()})
    return chunks

When to use: Technical documentation, books, articles with clear headings.

9. Keyword‑Based Chunking

Split whenever a predefined keyword appears.

def keyword_based_chunking(text: str, keywords: list):
    pattern = '|'.join(map(re.escape, keywords))
    parts = re.split(f'(?=({pattern}))', text)
    chunks = []
    current = ''
    for part in parts:
        if any(part.startswith(k) for k in keywords):
            if current:
                chunks.append(current.strip())
            current = part
        else:
            current += part
    if current:
        chunks.append(current.strip())
    return chunks

When to use: Meeting minutes, logs where specific markers denote new sections.

10. Entity‑Based Chunking

Run a Named Entity Recogniser (e.g., spaCy) and group sentences by shared entities.

import spacy

def entity_based_chunking(text: str):
    nlp = spacy.load('zh_core_web_sm')  # replace with appropriate model
    doc = nlp(text)
    entity_map = {}
    for ent in doc.ents:
        entity_map.setdefault(ent.text, []).append(ent.sent.text)
    chunks = []
    for entity, sentences in entity_map.items():
        chunks.append({'entity': entity, 'content': ' '.join(sentences)})
    return chunks

When to use: News articles, contracts, or any text where entities are central.

11. Token‑Based Chunking

Count tokens using a tokenizer (e.g., tiktoken) and enforce a maximum token limit per chunk.

import tiktoken

def token_based_chunking(text: str, model_name: str = 'gpt-4', max_tokens: int = 100):
    enc = tiktoken.encoding_for_model(model_name)
    tokens = enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i+max_tokens]
        chunks.append(enc.decode(chunk_tokens))
    return chunks

When to use: When you must stay within LLM token limits.

12. Table‑Aware Chunking

Detect ASCII‑style tables, keep them intact, and treat surrounding text as separate chunks.

def table_aware_chunking(text: str):
    table_pat = r'(\+[-]+\+.*?)(?=

|\Z)'
    tables = re.findall(table_pat, text, re.MULTILINE)
    non_table = re.sub(table_pat, '', text, flags=re.MULTILINE)
    chunks = [{'type': 'text', 'content': p.strip()} for p in re.split(r'
\s*
', non_table) if p.strip()]
    for tbl in tables:
        md = tbl.replace('+', '|')
        chunks.append({'type': 'table', 'content': md})
    return chunks

When to use: Financial reports, data tables, specifications.

13. Content‑Aware Chunking

Detect the type of each block (list, code, heading, table, quote, paragraph) and store metadata.

def content_aware_chunking(text: str):
    blocks = re.split(r'
\s*
', text)
    chunks = []
    for blk in blocks:
        blk = blk.strip()
        if not blk:
            continue
        if re.match(r'^\s*[\d•\-\*]\s+', blk):
            typ = 'list'
        elif blk.startswith('```') or re.match(r'^ {4,}', blk, re.MULTILINE):
            typ = 'code'
        elif re.match(r'^#{1,3}\s+', blk):
            typ = 'heading'
        elif re.match(r'^\|.+\|$', blk, re.MULTILINE) or re.match(r'^\+[-]+\+$', blk, re.MULTILINE):
            typ = 'table'
        elif blk.startswith('>'):
            typ = 'quote'
        else:
            typ = 'paragraph'
        chunks.append({'type': typ, 'content': blk})
    return chunks

When to use: Mixed‑format documents where preserving format matters.

Intelligent Chunking Strategies (8)

14. Topic‑Based Chunking

Apply LDA or clustering to group sentences by latent topics.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re

def topic_based_chunking(texts: list, n_topics: int = 3):
    sentences = []
    for txt in texts:
        sentences.extend([s.strip() for s in re.split(r'[。!?.!?]+', txt) if s.strip()])
    vec = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
    X = vec.fit_transform(sentences)
    lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10, learning_method='online', random_state=42)
    lda.fit(X)
    topics = lda.transform(X)
    groups = {i: [] for i in range(n_topics)}
    for i, probs in enumerate(topics):
        groups[np.argmax(probs)].append(sentences[i])
    chunks = []
    for tid, sents in groups.items():
        if sents:
            chunks.append({'topic_id': tid, 'content': ' '.join(sents), 'sentence_count': len(sents)})
    return chunks

When to use: Documents covering multiple themes without explicit headings.

15. Contextual Chunking (LLM‑Enhanced)

Prompt an LLM to generate concise context metadata for each chunk.

import openai

def contextual_chunking(texts: list, prompt: str = None):
    if not prompt:
        prompt = """Provide for each text block: 1) core keywords, 2) possible link to previous block, 3) key entities, 4) sentiment. Return JSON."""
    enriched = []
    for i, txt in enumerate(texts):
        try:
            resp = openai.ChatCompletion.create(
                model='gpt-3.5-turbo',
                messages=[{'role': 'system', 'content': 'You are a professional text analyst.'},
                          {'role': 'user', 'content': f"{prompt}

Text block:
{txt}"}],
                temperature=0.3)
            ctx = resp.choices[0].message.content
            try:
                ctx_json = json.loads(ctx)
            except:
                ctx_json = {'raw_context': ctx}
            enriched.append({'original_text': txt, 'context': ctx_json, 'enhanced_text': f"Context: {ctx}

Original: {txt}"})
        except Exception as e:
            enriched.append({'original_text': txt, 'context': {}, 'enhanced_text': txt})
    return enriched

When to use: Complex legal or scientific documents where nuanced context improves retrieval.

16. Semantic Chunking

Group sentences whose embeddings have cosine similarity above a threshold.

from sentence_transformers import SentenceTransformer
import numpy as np, re

def semantic_chunking(text: str, threshold: float = 0.7):
    sentences = [s.strip() for s in re.split(r'[。!?.!?]+', text) if s.strip()]
    if len(sentences) <= 1:
        return [text]
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    emb = model.encode(sentences)
    chunks = []
    cur = [sentences[0]]
    for i in range(1, len(sentences)):
        sim = np.dot(emb[i], emb[i-1]) / (np.linalg.norm(emb[i]) * np.linalg.norm(emb[i-1]))
        if sim > threshold:
            cur.append(sentences[i])
        else:
            chunks.append(' '.join(cur))
            cur = [sentences[i]]
    if cur:
        chunks.append(' '.join(cur))
    return chunks

When to use: Long narratives where thematic continuity is not captured by simple delimiters.

17. Recursive Chunking

Apply a hierarchy of separators (paragraph → sentence → fixed size) until all chunks satisfy a size limit.

def recursive_chunking(text: str, separators=None, max_len: int = 100):
    if separators is None:
        separators = ['

', '。', '!', '?', '.', ' ']
    def split(chunk, idx=0):
        if len(chunk) <= max_len or idx >= len(separators):
            return [chunk.strip()] if chunk.strip() else []
        sep = separators[idx]
        parts = chunk.split(sep)
        if sep != ' ':
            parts = [p + sep for p in parts[:-1]] + [parts[-1]]
        result = []
        for part in parts:
            if len(part) > max_len:
                result.extend(split(part, idx+1))
            else:
                result.append(part.strip())
        return result
    return split(text)

When to use: Interviews, speeches, or any free‑form text with unpredictable length.

18. Embedding‑Based Chunking (Similarity Merge, Clustering, Sliding Window)

A class that loads a sentence‑transformer model and offers three merging strategies.

from sentence_transformers import SentenceTransformer
import numpy as np, re
from sklearn.cluster import AgglomerativeClustering

class EmbeddingChunker:
    def __init__(self, model_name='paraphrase-multilingual-MiniLM-L12-v2'):
        self.model = SentenceTransformer(model_name)

    def _split_units(self, text):
        return [s.strip() for s in re.split(r'[。!?.!?]+', text) if s.strip()]

    def _cosine(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def similarity_merge(self, text, threshold=0.7, max_size=500):
        units = self._split_units(text)
        emb = self.model.encode(units)
        chunks = []
        cur_units, cur_emb = [units[0]], [emb[0]]
        for i in range(1, len(units)):
            sim = self._cosine(emb[i], np.mean(cur_emb, axis=0))
            cur_text = ' '.join(cur_units + [units[i]])
            if sim > threshold and len(cur_text) <= max_size:
                cur_units.append(units[i]); cur_emb.append(emb[i])
            else:
                chunks.append({'content': ' '.join(cur_units)})
                cur_units, cur_emb = [units[i]], [emb[i]]
        if cur_units:
            chunks.append({'content': ' '.join(cur_units)})
        return chunks

    def clustering_merge(self, text, min_size=30):
        units = self._split_units(text)
        emb = self.model.encode(units)
        n_clusters = max(2, len(units)//5)
        clustering = AgglomerativeClustering(n_clusters=n_clusters, affinity='cosine', linkage='average')
        labels = clustering.fit_predict(emb)
        chunks = []
        for cid in range(n_clusters):
            idx = np.where(labels == cid)[0]
            if len(idx) * len(units[0]) >= min_size:
                content = ' '.join([units[i] for i in idx])
                chunks.append({'cluster_id': cid, 'content': content})
        return chunks

    def sliding_window_merge(self, text, threshold=0.6, window=3):
        units = self._split_units(text)
        emb = self.model.encode(units)
        i = 0
        chunks = []
        while i < len(units):
            end = min(i+window, len(units))
            sims = [self._cosine(emb[j], emb[k]) for j in range(i, end) for k in range(j+1, end)]
            avg = np.mean(sims) if sims else 0
            if avg > threshold or end - i == 1:
                while end < len(units):
                    new_sim = np.mean([self._cosine(emb[end], emb[j]) for j in range(i, end)])
                    if new_sim > threshold * 0.9:
                        end += 1
                    else:
                        break
                chunks.append({'content': ' '.join(units[i:end])})
                i = end
            else:
                i += 1
        return chunks

When to use: When you need data‑driven chunk boundaries and have compute resources.

19. Agentic / LLM‑Based Chunking

Let a large language model decide where to split the text.

import openai, json

def llm_based_chunking(text: str, api_key: str, model: str = 'gpt-3.5-turbo'):
    openai.api_key = api_key
    prompt = f"""Split the following text into semantically complete chunks of about 100‑200 words. Return a JSON object with a \"chunks\" array.

Text:
{text[:1000]}"""
    try:
        resp = openai.ChatCompletion.create(model=model, messages=[{'role':'system','content':'You are a text‑analysis assistant.'},{'role':'user','content':prompt}], temperature=0.3)
        result = json.loads(resp.choices[0].message.content)
        return result.get('chunks', [text])
    except Exception as e:
        print(f"LLM chunking failed: {e}")
        return recursive_chunking(text)

When to use: Highly unstructured or domain‑specific documents where human‑like judgment is required, and cost is acceptable.

20. Hierarchical Chunking

Build a multi‑level tree (document → chapters → sections → paragraphs → sentences) to support multi‑granularity retrieval.

class HierarchicalChunker:
    def __init__(self):
        self.hierarchy = {'document': None, 'chapters': [], 'sections': [], 'paragraphs': [], 'sentences': []}

    def build_hierarchy(self, text: str, max_depth: int = 4):
        self.hierarchy['document'] = {'content': text[:500] + ('...' if len(text) > 500 else ''), 'full_content': text}
        self._split_chapters(text, max_depth)
        return self.hierarchy

    def _split_chapters(self, text, depth):
        if depth <= 0:
            return
        patterns = [r'^(#{1,3})\s+(.+)$', r'^第[一二三四五六七八九十\d]+章\s+.+$', r'^Chapter\s+\d+[:.-]?\s+.+$']
        matches = []
        for pat in patterns:
            m = list(re.finditer(pat, text, re.MULTILINE))
            if m and (not matches or len(m) > len(matches)):
                matches = m
        if not matches:
            self._split_paragraphs(text, depth)
            return
        chapters = []
        last = 0
        for i, m in enumerate(matches):
            start = m.start()
            if start > last:
                chapters.append({'title': f'Part {i}', 'content': text[last:start].strip()})
            last = start
        chapters.append({'title': f'Part {len(matches)+1}', 'content': text[last:].strip()})
        self.hierarchy['chapters'] = chapters
        for chap in chapters:
            self._split_sections(chap['content'], depth-1, chap['title'])

    def _split_sections(self, text, depth, parent_title):
        if depth <= 0:
            return
        matches = list(re.finditer(r'^(#{2,4})\s+(.+)$', text, re.MULTILINE))
        if not matches:
            self._split_paragraphs(text, depth)
            return
        sections = []
        last = 0
        for i, m in enumerate(matches):
            start = m.start()
            if start > last:
                sections.append({'parent': parent_title, 'title': f'Section {i}', 'content': text[last:start].strip()})
            last = start
        sections.append({'parent': parent_title, 'title': f'Section {len(matches)+1}', 'content': text[last:].strip()})
        self.hierarchy['sections'].extend(sections)
        for sec in sections:
            self._split_paragraphs(sec['content'], depth-1)

    def _split_paragraphs(self, text, depth):
        paras = [p.strip() for p in re.split(r'
\s*
', text) if p.strip()]
        for i, p in enumerate(paras, 1):
            para_obj = {'id': f'para_{len(self.hierarchy["paragraphs"]) + 1}', 'content': p, 'sentence_count': len(re.split(r'[。!?.!?]+', p))}
            self.hierarchy['paragraphs'].append(para_obj)
            if depth > 3:
                self._split_sentences(p)

    def _split_sentences(self, text):
        sents = [s.strip() for s in re.split(r'[。!?.!?]+', text) if s.strip()]
        for i, s in enumerate(sents, 1):
            self.hierarchy['sentences'].append({'id': f'sent_{len(self.hierarchy["sentences"]) + 1}', 'content': s})

    def get_chunks_at_level(self, level: str, min_length: int = 0):
        if level not in self.hierarchy:
            return []
        return [c for c in self.hierarchy[level] if len(c.get('content','')) >= min_length]

When to use: Books, encyclopedias, knowledge bases that need multi‑granular access.

21. Modality‑Aware Chunking

Detect and separately process text, images, tables, and code within a document.

import re, json
from PIL import Image
import pytesseract
import pandas as pd

class MultiModalChunker:
    def __init__(self):
        self.chunks = []

    def process(self, doc_path: str):
        if doc_path.lower().endswith('.pdf'):
            return self._process_pdf(doc_path)
        elif doc_path.lower().endswith(('.png', '.jpg', '.jpeg')):
            return self._process_image(doc_path)
        elif doc_path.lower().endswith('.docx'):
            return self._process_docx(doc_path)
        else:
            return self._process_text(doc_path)

    def _process_text(self, path):
        with open(path, 'r', encoding='utf-8') as f:
            text = f.read()
        return self._chunk_by_type(text)

    def _chunk_by_type(self, text):
        lines = text.split('
')
        cur, cur_type = [], None
        for line in lines:
            line = line.rstrip()
            if not line:
                continue
            if re.match(r'^\s*[\d•\-\*]\s+', line):
                typ = 'list'
            elif line.startswith('```') or re.match(r'^ {4,}', line, re.MULTILINE):
                typ = 'code'
            elif re.match(r'^#{1,6}\s+', line):
                typ = 'heading'
            elif re.match(r'^\|.+\|$', line) or re.match(r'^\+[-]+\+$', line):
                typ = 'table'
            elif line.startswith('>'):
                typ = 'quote'
            else:
                typ = 'text'
            if typ != cur_type and cur:
                self._save_chunk(cur, cur_type)
                cur = []
            cur_type = typ
            cur.append(line)
        if cur:
            self._save_chunk(cur, cur_type)
        return self.chunks

    def _save_chunk(self, lines, typ):
        content = '
'.join(lines)
        meta = {'line_count': len(lines), 'type': typ}
        if typ == 'table':
            content = self._process_table(content)
        self.chunks.append({'type': typ, 'content': content, 'metadata': meta})

    def _process_table(self, txt):
        try:
            rows = [r.strip() for r in txt.split('
') if '|' in r]
            data = [ [c.strip() for c in r.split('|') if c.strip()] for r in rows]
            if len(data) >= 2:
                headers = data[0]
                body = data[1:]
                return json.dumps({'headers': headers, 'rows': body, 'row_count': len(body), 'col_count': len(headers)})
        except:
            pass
        return txt

When to use: Technical manuals, product datasheets, or any document mixing text, code, tables, and images.

Choosing the Right Strategy – Decision Guide

Ask four key questions:

Document type: Structured (Markdown/HTML/JSON) → use document‑structure or structured chunking; semi‑structured (reports, papers) → paragraph or recursive chunking; unstructured (scans, chats) → fixed‑size, sliding‑window, or semantic chunking.

Query characteristics: Fact‑looking → sentence‑level chunks; analytical → paragraph‑level; mixed → hierarchical or multi‑granular chunks.

Resource constraints: Limited compute → rule‑based (fixed‑size, line, sentence); ample compute → embedding‑based, LLM‑based.

Real‑time requirement: Avoid heavy models (LLM, clustering) for latency‑critical pipelines.

Common Pitfalls and How to Avoid Them

Too fragmented: Increase chunk size or add overlap.

Too large: Reduce size or switch to fixed‑size/window methods.

Ignoring structure: Preserve tables, code blocks, and headings using structure‑aware or table‑aware chunking.

Multilingual punctuation: Use language‑specific sentence splitters (e.g., spaCy multilingual models).

Practical Tips

Start with simple recursive chunking; iterate to more sophisticated methods only if retrieval quality suffers.

Validate chunking by running real queries against your vector store and inspecting results.

Combine strategies: e.g., apply document‑structure chunking first, then semantic merging within each section.

Conclusion

Effective chunking is the foundation of a performant RAG system. By understanding the 21 strategies outlined above, you can select, combine, and fine‑tune the approach that best fits your data, resources, and use‑case, ultimately delivering more accurate and context‑aware generative AI applications.

About the Authors

Data‑Pai THU is a data‑science community backed by Tsinghua University’s Big Data Research Center. The article was edited by Yu Teng‑Kai and proof‑read by Lin Yi‑Lin.

AILLMRAGtext processingChunking
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.