Mastering Text Splitting in LangChain: From Theory to Code

This guide explains why large documents must be broken into semantic chunks for LLMs, introduces core parameters like chunk_size and chunk_overlap, compares LangChain's various splitters, and walks through a complete Python example that loads a long text, configures a RecursiveCharacterTextSplitter, and inspects the resulting chunks.

BirdNest Tech Talk
BirdNest Tech Talk
BirdNest Tech Talk
Mastering Text Splitting in LangChain: From Theory to Code

When feeding external data into large language models, documents often exceed the model's context window (e.g., 4k, 8k, or even 128k tokens). To keep the input within limits, the document must be divided into smaller, semantically related "chunks" using a text splitter.

Why Split Text?

Fit the context window : Smaller chunks let the model focus on the portion most relevant to the user query.

Improve retrieval quality : In Retrieval‑Augmented Generation (RAG), each chunk is embedded and stored in a vector database; concise, focused chunks produce more accurate similarity matches, whereas overly large chunks dilute semantics.

Core Splitting Concepts

A good strategy preserves semantic integrity while producing appropriately sized pieces. Instead of cutting at an arbitrary character count, the splitter should look for natural boundaries such as paragraphs or sentences.

chunk_size and chunk_overlap

chunk_size : Maximum size of a chunk, measured in characters or tokens.

chunk_overlap : Number of characters (or tokens) that adjacent chunks share. A small overlap (e.g., 100‑200 characters) maintains continuity across chunk borders, preventing a single long sentence from being split.

Text Splitters Provided by LangChain

RecursiveCharacterTextSplitter : The recommended default. It recursively applies a list of separators (default ["\n\n", "\n", " ", ""]) starting with paragraph breaks, then line breaks, then spaces, ensuring splits occur at the highest‑level semantic boundaries.

CharacterTextSplitter : Simpler; uses a single user‑specified separator (e.g., "\n\n").

TokenTextSplitter : Splits directly by token count, useful for precise token budgeting but requires a tokenizer.

Language‑specific splitters (Python, JavaScript, Markdown, etc.) understand the syntax of those languages and produce smarter chunks.

Step‑by‑Step Example Using RecursiveCharacterTextSplitter

# example_1_recursive_splitter.py
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

def main():
    """Demonstrates splitting a long document into smaller chunks using the
    recommended RecursiveCharacterTextSplitter."""
    # 1. Load the long document
    file_path = os.path.join(os.path.dirname(__file__), "long_text_sample.txt")
    if not os.path.exists(file_path):
        print(f"Error: sample file '{file_path}' does not exist.")
        return
    loader = TextLoader(file_path, encoding="utf-8")
    documents = loader.load()
    original_doc = documents[0]
    print("--- 1. Loaded original document ---")
    print(f"Original document length: {len(original_doc.page_content)}")
    print("-" * 30)

    # 2. Create the splitter
    # chunk_size: max characters per chunk
    # chunk_overlap: characters shared between adjacent chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=150,
        chunk_overlap=20,
        length_function=len,
        add_start_index=True  # store original start index in metadata
    )
    print("
--- 2. Created RecursiveCharacterTextSplitter ---")
    print("Chunk Size: 150, Chunk Overlap: 20")
    print("-" * 30)

    # 3. Split the document
    split_docs = text_splitter.split_documents(documents)
    print("
--- 3. Splitting completed ---")
    print(f"Original document count: {len(documents)}")
    print(f"Number of chunks: {len(split_docs)}")
    print("-" * 30)

    # 4. Inspect the chunks
    print("
--- 4. Inspecting chunks ---")
    for i in range(3):
        print(f"
--- Chunk {i} ---")
        print(split_docs[i].page_content)
        print(f"Metadata: {split_docs[i].metadata}")
        print(f"Length: {len(split_docs[i].page_content)}")
    print("
Analysis: The start of chunk 1 overlaps with the end of chunk 0,"
          " demonstrating how chunk_overlap=20 preserves context continuity.")

if __name__ == "__main__":
    # pip install langchain-community
    main()

The script first checks that long_text_sample.txt exists, loads it with TextLoader, and prints the original character count. It then instantiates RecursiveCharacterTextSplitter with chunk_size=150 and chunk_overlap=20, enabling start‑index metadata. After splitting, it reports how many chunks were produced and prints the content, metadata, and length of the first three chunks, highlighting the overlapping region that ensures semantic continuity.

References

How to: recursively split text – https://python.langchain.com/docs/how_to/recursive_text_splitter

How to: split HTML – https://python.langchain.com/docs/how_to/html_splitter

How to: split by character – https://python.langchain.com/docs/how_to/character_splitter

How to: split code – https://python.langchain.com/docs/how_to/code_splitter

How to: split Markdown by headers – https://python.langchain.com/docs/how_to/markdown_splitter

How to: recursively split JSON – https://python.langchain.com/docs/how_to/json_splitter

How to: split text into semantic chunks – https://python.langchain.com/docs/how_to/semantic_chunking

How to: split by tokens – https://python.langchain.com/docs/how_to/token_splitter

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LangChainRAGEmbeddingText SplittingRecursive Splitter
BirdNest Tech Talk
Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.