Mastering Text Splitting in LangChain: From Theory to Code
This guide explains why large documents must be broken into semantic chunks for LLMs, introduces core parameters like chunk_size and chunk_overlap, compares LangChain's various splitters, and walks through a complete Python example that loads a long text, configures a RecursiveCharacterTextSplitter, and inspects the resulting chunks.
When feeding external data into large language models, documents often exceed the model's context window (e.g., 4k, 8k, or even 128k tokens). To keep the input within limits, the document must be divided into smaller, semantically related "chunks" using a text splitter.
Why Split Text?
Fit the context window : Smaller chunks let the model focus on the portion most relevant to the user query.
Improve retrieval quality : In Retrieval‑Augmented Generation (RAG), each chunk is embedded and stored in a vector database; concise, focused chunks produce more accurate similarity matches, whereas overly large chunks dilute semantics.
Core Splitting Concepts
A good strategy preserves semantic integrity while producing appropriately sized pieces. Instead of cutting at an arbitrary character count, the splitter should look for natural boundaries such as paragraphs or sentences.
chunk_size and chunk_overlap
chunk_size : Maximum size of a chunk, measured in characters or tokens.
chunk_overlap : Number of characters (or tokens) that adjacent chunks share. A small overlap (e.g., 100‑200 characters) maintains continuity across chunk borders, preventing a single long sentence from being split.
Text Splitters Provided by LangChain
RecursiveCharacterTextSplitter : The recommended default. It recursively applies a list of separators (default ["\n\n", "\n", " ", ""]) starting with paragraph breaks, then line breaks, then spaces, ensuring splits occur at the highest‑level semantic boundaries.
CharacterTextSplitter : Simpler; uses a single user‑specified separator (e.g., "\n\n").
TokenTextSplitter : Splits directly by token count, useful for precise token budgeting but requires a tokenizer.
Language‑specific splitters (Python, JavaScript, Markdown, etc.) understand the syntax of those languages and produce smarter chunks.
Step‑by‑Step Example Using RecursiveCharacterTextSplitter
# example_1_recursive_splitter.py
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
def main():
"""Demonstrates splitting a long document into smaller chunks using the
recommended RecursiveCharacterTextSplitter."""
# 1. Load the long document
file_path = os.path.join(os.path.dirname(__file__), "long_text_sample.txt")
if not os.path.exists(file_path):
print(f"Error: sample file '{file_path}' does not exist.")
return
loader = TextLoader(file_path, encoding="utf-8")
documents = loader.load()
original_doc = documents[0]
print("--- 1. Loaded original document ---")
print(f"Original document length: {len(original_doc.page_content)}")
print("-" * 30)
# 2. Create the splitter
# chunk_size: max characters per chunk
# chunk_overlap: characters shared between adjacent chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=20,
length_function=len,
add_start_index=True # store original start index in metadata
)
print("
--- 2. Created RecursiveCharacterTextSplitter ---")
print("Chunk Size: 150, Chunk Overlap: 20")
print("-" * 30)
# 3. Split the document
split_docs = text_splitter.split_documents(documents)
print("
--- 3. Splitting completed ---")
print(f"Original document count: {len(documents)}")
print(f"Number of chunks: {len(split_docs)}")
print("-" * 30)
# 4. Inspect the chunks
print("
--- 4. Inspecting chunks ---")
for i in range(3):
print(f"
--- Chunk {i} ---")
print(split_docs[i].page_content)
print(f"Metadata: {split_docs[i].metadata}")
print(f"Length: {len(split_docs[i].page_content)}")
print("
Analysis: The start of chunk 1 overlaps with the end of chunk 0,"
" demonstrating how chunk_overlap=20 preserves context continuity.")
if __name__ == "__main__":
# pip install langchain-community
main()The script first checks that long_text_sample.txt exists, loads it with TextLoader, and prints the original character count. It then instantiates RecursiveCharacterTextSplitter with chunk_size=150 and chunk_overlap=20, enabling start‑index metadata. After splitting, it reports how many chunks were produced and prints the content, metadata, and length of the first three chunks, highlighting the overlapping region that ensures semantic continuity.
References
How to: recursively split text – https://python.langchain.com/docs/how_to/recursive_text_splitter
How to: split HTML – https://python.langchain.com/docs/how_to/html_splitter
How to: split by character – https://python.langchain.com/docs/how_to/character_splitter
How to: split code – https://python.langchain.com/docs/how_to/code_splitter
How to: split Markdown by headers – https://python.langchain.com/docs/how_to/markdown_splitter
How to: recursively split JSON – https://python.langchain.com/docs/how_to/json_splitter
How to: split text into semantic chunks – https://python.langchain.com/docs/how_to/semantic_chunking
How to: split by tokens – https://python.langchain.com/docs/how_to/token_splitter
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
BirdNest Tech Talk
Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
