How to Load and Split Documents for RAG: First Step to Building a Knowledge Base
This tutorial explains why document loading and splitting are critical for RAG pipelines, introduces LangChain's Document format, demonstrates loaders for various file types, details the RecursiveCharacterTextSplitter and alternative splitters, and provides practical tips on parameter tuning, metadata preservation, Chinese text handling, and common pitfalls.
01 Role of Document Processing in RAG
Before tackling vector search or prompt engineering, the first step of a RAG project is to load and split documents correctly; poor input data (Garbage In) limits the performance of all downstream stages.
02 Document Object: LangChain Standard Format
All documents in LangChain are represented by a Document object with two fields: pageContent (the text) and metadata (auxiliary information). The loader outputs an array of Document objects, the splitter consumes that array, and the embedder extracts vectors from pageContent while metadata is used for filtering and provenance.
import { Document } from "@langchain/core/documents";
const doc = new Document({
pageContent: "This is the document text loaded by a Loader",
metadata: {
source: "report.pdf",
page: 3,
author: "James",
createdAt: "2025-01-15"
}
});
console.log(doc.pageContent);
console.log(doc.metadata);03 Document Loader: Loading Various Formats
3.1 Loading Plain Text Files
import { TextLoader } from "langchain/document_loaders/fs/text";
const loader = new TextLoader("./data/readme.txt");
const docs = await loader.load();
console.log(docs.length); // 1 (the whole file as one Document)
console.log(docs[0].pageContent);
console.log(docs[0].metadata.source); // "./data/readme.txt"3.2 Loading PDF Files
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
// default: one Document per page
const loader = new PDFLoader("./data/ai-report.pdf");
const docs = await loader.load();
console.log(docs.length); // number of pages
console.log(docs[0].metadata); // { source: "./data/ai-report.pdf", pdf: {...}, loc: { pageNumber: 1 } }
// merge all pages into a single Document
const loaderMerged = new PDFLoader("./data/ai-report.pdf", { splitPages: false });
const mergedDocs = await loaderMerged.load();
console.log(mergedDocs.length); // 13.3 Loading CSV Files
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
const loader = new CSVLoader("./data/products.csv");
const docs = await loader.load();
console.log(docs[0].pageContent); // "name: iPhone 15
price: 7999
category: 手机"
console.log(docs[0].metadata); // { source: "./data/products.csv", line: 1 }
// use a specific column as pageContent
const loaderWithColumn = new CSVLoader("./data/products.csv", { column: "description" });3.4 Loading JSON Files
import { JSONLoader } from "langchain/document_loaders/fs/json";
// default: extract all string values
const loader = new JSONLoader("./data/faq.json");
const docs = await loader.load();
// extract only the "answer" field using a JSON Pointer
const loaderWithPointer = new JSONLoader("./data/faq.json", "/answer");
const answerDocs = await loaderWithPointer.load();
console.log(answerDocs[0].pageContent); // first FAQ answer3.5 Loading Web Pages
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
const loader = new CheerioWebBaseLoader("https://docs.langchain.com/docs/get_started/introduction");
const docs = await loader.load();
console.log(docs[0].pageContent.substring(0, 200)); // first 200 characters of the page
console.log(docs[0].metadata.source); // URL3.6 Bulk Loading Directories
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
const loader = new DirectoryLoader("./data/knowledge-base", {
".txt": path => new TextLoader(path),
".pdf": path => new PDFLoader(path),
".csv": path => new CSVLoader(path)
});
const allDocs = await loader.load();
console.log(`Total loaded ${allDocs.length} Documents`);
const sourceCount = allDocs.reduce((acc, doc) => {
const ext = doc.metadata.source.split('.').pop();
acc[ext] = (acc[ext] || 0) + 1;
return acc;
}, {});
console.log(sourceCount); // e.g., { txt: 12, pdf: 45, csv: 8 }04 Text Splitter: Why Split Documents
Loading long documents (e.g., a 10,000‑character PDF) as a single chunk leads to three problems:
Why must we split documents?
Problem 1: Poor vector quality – a single vector represents the average semantics of the whole text, so specific queries match poorly.
Problem 2: Context waste – retrieving the whole document consumes many tokens, most of which are irrelevant.
Problem 3: Model limits – many models have a context window (e.g., 128 K tokens); a single document may exceed it.The goal of splitting is to create semantically coherent chunks of manageable size so that retrieval can precisely match user questions.
05 RecursiveCharacterTextSplitter: Most Common Splitter
The default LangChain splitter recursively tries a list of separators ("\n\n", "\n", " ", "") until each chunk is ≤ chunkSize. This preserves paragraphs and sentences as much as possible.
RecursiveCharacterTextSplitter splitting logic
Input text
│
▼
Try "
" (paragraph)
│
├── chunk ≤ chunkSize? ✅ keep
└── chunk > chunkSize? continue ↓
▼
Try "
" (line break)
│
├── chunk ≤ chunkSize? ✅ keep
└── chunk > chunkSize? continue ↓
▼
Try " " (space)
│
├── chunk ≤ chunkSize? ✅ keep
└── chunk > chunkSize? continue ↓
▼
Hard split by single characters import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 50,
separators: ["
", "
", " ", ""]
});
const text = `第一章 产品概述
本产品是一款基于大语言模型的智能客服系统,支持多轮对话、知识库检索、工单自动创建等功能。
第二章 核心功能
2.1 多轮对话
系统支持上下文感知的多轮对话,能够记住用户在当前会话中提出的历史问题,自动关联上下文进行回答。
2.2 知识库检索
基于 RAG 架构,系统会自动从企业知识库中检索相关文档片段,结合大模型生成准确回答。支持 PDF、Word、网页等多种文档格式。`;
const chunks = await splitter.createDocuments([text]);
chunks.forEach((chunk, i) => {
console.log(`
--- Chunk ${i} (${chunk.pageContent.length} chars) ---`);
console.log(chunk.pageContent);
});chunkSize : maximum characters per chunk (not a strict limit, but "no more than").
chunkOverlap : number of overlapping characters between adjacent chunks to keep context continuity.
Effect of chunkOverlap
chunkOverlap = 0 (no overlap)
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Chunk 0 │ │ Chunk 1 │ │ Chunk 2 │
└──────────┘ └──────────┘ └──────────┘
↑ may cut a sentence in half
chunkOverlap = 50
┌───────────────┐
│ Chunk 0 │
└─────┬─────────┘
│Overlap│
┌────┴─────┐
│ Chunk 1 │
└────┬─────┘
│Overlap│
┌────┴─────┐
│ Chunk 2 │
└───────────┘
Overlap preserves continuity.06 Other Splitters: CharacterTextSplitter and TokenTextSplitter
CharacterTextSplitter
Splits only by a single separator without recursion, suitable for highly regular texts such as logs or CSV‑converted text.
import { CharacterTextSplitter } from "@langchain/textsplitters";
const splitter = new CharacterTextSplitter({
separator: "
",
chunkSize: 500,
chunkOverlap: 0
});
const text = "Paragraph one...
Paragraph two...
Paragraph three...";
const chunks = await splitter.createDocuments([text]);
// If a paragraph exceeds 500 characters, it is kept as‑is (no further splitting).TokenTextSplitter
Splits by token count instead of characters, useful when you need precise token budgeting.
import { TokenTextSplitter } from "@langchain/textsplitters";
const splitter = new TokenTextSplitter({
chunkSize: 200, // max 200 tokens per chunk
chunkOverlap: 20,
encodingName: "cl100k_base" // OpenAI tokenizer
});
const docs = await splitter.createDocuments([longText]);
// Each chunk respects the token limit, ideal for LLM input control.Comparison table:
┌────────────────────┬───────────────┬─────────────────────┬─────────────┐
│ Splitter │ Splitting │ Suitable Scenarios │ Rating │
├────────────────────┼───────────────┼─────────────────────┼─────────────┤
│ Recursive │ Multi‑level │ General documents │ ⭐⭐⭐⭐⭐ │
│ Character │ Single sep. │ Structured logs/CSV │ ⭐⭐⭐ │
│ Token │ Token count │ Precise token control│ ⭐⭐⭐⭐ │
└────────────────────┴───────────────┴─────────────────────┴─────────────┘07 Parameter Tuning: Choosing chunkSize and chunkOverlap
There is no silver bullet; the optimal values depend on document type, embedding model, and retrieval needs.
chunkSize selection strategy
chunkSize too small (< 200 chars)
┌───────────────────────────────────────┐
│ "System supports multi‑turn dialogue" │
│ Problem: semantics incomplete, lacks context.
└───────────────────────────────────────┘
chunkSize too large (> 2000 chars)
┌───────────────────────────────────────┐
│ "Chapter 1 Overview... Chapter 2 Features... Chapter 3 Architecture..."
│ Problem: multiple topics mixed, vector becomes average semantics, low retrieval precision.
└───────────────────────────────────────┘
Ideal size (500‑1000 chars)
┌───────────────────────────────────────┐
│ "2.2 Knowledge Base Retrieval ... supports PDF, Word, web ..."
│ Semantic completeness, focused topic, token‑efficient.
└───────────────────────────────────────┘ // Scenario‑specific configurations
// Technical docs / product manuals
const techDocSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 100
});
// FAQ / Q&A pairs (short texts)
const faqSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 300,
chunkOverlap: 30
});
// Legal contracts / long reports
const legalSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});chunkOverlap guidelines
Typically set to 10‑20 % of chunkSize. Zero overlap cuts sentences in half, harming retrieval; a modest overlap preserves continuity.
Recommended configurations
┌─────────────┬────────────┬───────────────┬──────────────────────┐
│ Document type│ chunkSize │ chunkOverlap │ Notes │
├─────────────┼────────────┼───────────────┼──────────────────────┤
│ FAQ / short │ 200‑400 │ 20‑50 │ Small chunks
│ Technical │ 500‑1000 │ 50‑100 │ Balance precision & completeness
│ Legal/Academic│800‑1500 │100‑200 │ Need full context
│ Code │ 500‑800 │ 50‑100 │ Prefer function‑level splitting
└─────────────┴────────────┴───────────────┴──────────────────────┘08 Metadata Retention: Preserve Source Information After Splitting
Each chunk automatically inherits the original Document's metadata and adds its own location info, enabling source‑based filtering and provenance during retrieval.
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const loader = new PDFLoader("./data/product-manual.pdf");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 50 });
const chunks = await splitter.splitDocuments(docs);
console.log(chunks[0].metadata);
// { source: "./data/product-manual.pdf", pdf: {...}, loc: { pageNumber: 1, lines: { from: 0, to: 15 } } }You can also enrich metadata before splitting:
const enrichedDocs = docs.map(doc => ({
...doc,
metadata: {
...doc.metadata,
department: "Product",
docType: "Manual",
version: "v2.1",
indexedAt: new Date().toISOString()
}
}));
const chunks = await splitter.splitDocuments(enrichedDocs);
console.log(chunks[0].metadata);
// { source: "...", department: "Product", docType: "Manual", version: "v2.1", ... }10 Special Handling for Chinese Documents
Chinese lacks natural whitespace, so the default separator list ("\n\n", "\n", " ", "") is insufficient. Add Chinese punctuation to the separator list.
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const chineseSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 50,
separators: ["
", "
", "。", "!", "?", ";", ",", " ", ""]
});
const chineseText = `人工智能正在深刻改变各行各业。在医疗领域,AI辅助诊断系统已经能够识别X光片中的异常,准确率超过资深医生。在金融领域,智能风控系统每天处理数百万笔交易,实时识别欺诈行为。
然而,AI技术的发展也带来了新的挑战。数据隐私保护、算法偏见、就业替代等问题,都需要社会各界共同面对和解决。`;
const chunks = await chineseSplitter.createDocuments([chineseText]);
chunks.forEach((chunk, i) => {
console.log(`
--- Chunk ${i} ---`);
console.log(chunk.pageContent);
});11 Common Pitfalls
Pitfall 1: PDF loader returns garbled or empty text
Scanned PDFs contain images; PDFLoader relies on pdf-parse, which only extracts text. Use an OCR step or a loader that supports OCR for image‑based PDFs.
Pitfall 2: Treating chunkSize as token count
RecursiveCharacterTextSplittermeasures characters, not tokens. For token‑accurate limits, either use TokenTextSplitter or provide a custom lengthFunction that counts tokens via a tokenizer.
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { encoding_for_model } from "tiktoken";
const enc = encoding_for_model("gpt-4o");
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500, // now interpreted as 500 tokens
chunkOverlap: 50,
lengthFunction: text => enc.encode(text).length
});Pitfall 3: Zero overlap removes context
Setting chunkOverlap = 0 can split a sentence in half, making each chunk semantically incomplete. Keep overlap at least 10 % of chunkSize.
Pitfall 4: Losing metadata during cleaning
When you create a new Document after cleaning text, remember to copy the original metadata; otherwise you lose source information needed for later filtering.
Pitfall 5: Loading huge files causes OOM
DirectoryLoaderrecursively loads every matching file. Filter out files larger than a safe threshold before loading.
import * as fs from "fs";
import * as path from "path";
const MAX_FILE_SIZE = 50 * 1024 * 1024; // 50 MB
const files = fs.readdirSync("./knowledge-base");
const safeFiles = files.filter(f => {
const filePath = path.join("./knowledge-base", f);
const stats = fs.statSync(filePath);
if (stats.size > MAX_FILE_SIZE) {
console.warn(`⚠️ Skipping large file: ${f} (${(stats.size/1024/1024).toFixed(1)} MB)`);
return false;
}
return true;
});Conclusion
Document is LangChain's unified data format with pageContent and metadata, used throughout the RAG pipeline.
Loader standardizes input from PDFs, TXT, CSV, JSON, web pages, or whole directories into Document[].
Splitting is mandatory for efficient retrieval; chunks of 500‑1000 characters strike a good balance.
RecursiveCharacterTextSplitter is the default choice for most scenarios, preserving semantic boundaries.
Parameter tuning depends on document type: adjust chunkSize and chunkOverlap accordingly.
Chinese text requires custom separators (punctuation) to avoid breaking words.
Next, we will dive deeper into LangChain splitters such as MarkdownSplitter and CodeSplitter, and show how to perform semantic chunking based on document structure to further improve RAG retrieval accuracy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
