Artificial Intelligence 10 min read

Mastering Chunking Strategies for Effective RAG: Fixed, Recursive, Semantic, Structured, and Delayed

This article walks through the core RAG pipeline, explains why chunking is the linchpin of retrieval quality, and provides detailed definitions, trade‑offs, and implementation examples for five chunking techniques—fixed, recursive, semantic, structure‑aware, and delayed—so you can choose the right approach for any document‑heavy AI application.

Data Party THU

Nov 9, 2025

Mastering Chunking Strategies for Effective RAG: Fixed, Recursive, Semantic, Structured, and Delayed

RAG Workflow Overview

RAG (Retrieval‑Augmented Generation) consists of four stages: document ingestion and chunking, vector embedding and storage, query‑time retrieval, and LLM‑driven answer generation. Retrieval quality directly determines the final output because the generator only sees the retrieved context.

Why Chunking Matters

LLMs have limited context windows, so raw documents must be split into manageable pieces that preserve semantic completeness. Over‑splitting creates noise; under‑splitting hides fine‑grained details. Effective chunking balances token limits, semantic integrity, and computational cost.

Five Main Chunking Techniques

1. Fixed Chunking

Split text into equal‑length blocks (by tokens, words, or characters) with a configurable overlap.

def fixed_chunk(text, max_tokens=512, overlap=50):
    tokens = tokenize(text)
    chunks = []
    i = 0
    while i < len(tokens):
        chunk = tokens[i : i + max_tokens]
        chunks.append(detokenize(chunk))
        i += (max_tokens - overlap)
    return chunks

Best for unstructured or monotonic data such as logs; serves as a solid baseline.

2. Recursive Chunking

First split at high‑level boundaries (paragraphs or sections). If a block exceeds the size limit, recursively split it further (e.g., by sentences) until all chunks fit.

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    separators=["

", "
", " ", ""]
)
chunks = text_splitter.split_text(text)
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:
{chunk}
{'-'*40}")

Preserves logical units and adapts chunk size to content structure.

3. Semantic Chunking

Compute sentence‑level embeddings and cut when semantic similarity drops below a threshold, ensuring each chunk groups semantically related sentences.

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunk(text, sentence_list, sim_threshold=0.7):
    embeddings = model.encode(sentence_list)
    chunks = []
    current = [sentence_list[0]]
    for i in range(1, len(sentence_list)):
        sim = util.cos_sim(embeddings[i-1], embeddings[i]).item()
        if sim < sim_threshold:
            chunks.append(" ".join(current))
            current = [sentence_list[i]]
        else:
            current.append(sentence_list[i])
    chunks.append(" ".join(current))
    return chunks

Ideal for high‑precision domains (legal, scientific) but incurs extra embedding cost.

4. Structure‑Aware Chunking

Leverage inherent document markup (HTML headings, tables, lists) as natural boundaries. Each heading level becomes a root node; large sections can fall back to recursive splitting. Tables and images may become separate chunks or be summarized.

Parse HTML/Markdown/PDF to extract hierarchical elements.

Use <h1>, <h2>, etc., as chunk anchors.

Recursively split overly long sections.

Optionally treat tables/images as independent chunks.

In practice this strategy yields the best results when combined with recursive splitting.

5. Delayed (Dynamic) Chunking

Store large passages or whole documents in the index. At query time, retrieve the most relevant large segment, then dynamically split only the relevant portion using semantic or overlap methods. This “reverse” order reduces upfront processing but requires long‑context models and higher query‑time compute.

Index large paragraphs or full documents.

When a query arrives, retrieve the top‑k relevant large segments.

Within those segments, perform on‑the‑fly fine‑grained splitting.

Filter/sort the resulting fine chunks and feed them to the LLM.

Suitable for massive, frequently updated corpora or high‑risk applications where preserving full context is critical.

Final Thoughts

Chunking may seem trivial, yet it caps the performance of any RAG system. Fixed chunking offers speed; recursive chunking balances semantics and size; semantic chunking maximizes relevance at higher cost; structure‑aware chunking excels on markup‑rich documents; delayed chunking provides the most flexibility when resources allow. In real projects, a hybrid approach—e.g., structure‑aware baseline plus recursive fallback—often yields the best trade‑off.