How to Load Documents into LangChain: From Files to APIs
Learn how to use LangChain's Document Loaders to import data from files, web pages, databases, and APIs, understand the Document object structure, compare load() versus lazy_load(), and follow a step‑by‑step Python example that demonstrates loading, inspecting, and optionally processing documents with an LLM.
In this tutorial the author explains how to bring external data into a LangChain application by using the built‑in Document Loaders. The first step when building a Q&A bot over internal documents is to load those documents into the program.
Document object
LangChain represents a piece of text as a Document object, which contains two core fields:
page_content (string) : the main text of the document.
metadata (dict) : a dictionary of auxiliary information such as source (file name, URL, table name), page (PDF page number), and row (CSV line number).
Example class definition:
class Document:
page_content: str
metadata: dict = {}Why use Document Loaders?
Document loaders abstract away the tedious work of extracting text from many complex sources. The LangChain community provides hundreds of loaders covering common formats: plain text (.txt), CSV, JSON, Markdown, PDF, Word (.docx), Excel (.xlsx), web pages, databases, and collaboration tools like Notion or Confluence.
Using a loader typically requires only the source path or URL and a call to .load(), which returns a list of Document objects ready for downstream components such as text splitters or embedding models.
load() vs lazy_load()
load(): reads the entire source into memory at once and returns a List[Document]. Suitable for small to medium datasets but can exhaust memory for large collections. lazy_load(): returns an iterator that yields documents one by one, allowing processing of very large files or datasets with minimal memory footprint.
The chapter’s example demonstrates both approaches with the most common loaders (web pages, plain‑text files, CSV).
Example 2: Using TextLoader to read a local text file
The script example_2_file_loader.py shows how to convert a local .txt file into LangChain Document objects and optionally feed them into a simple LLM workflow.
Prepare file path : The script builds an absolute path to sample.txt located in the same directory as the script and prints a friendly error if the file is missing.
Load documents : It creates a TextLoader(file_path, encoding="utf-8") instance and calls .load(). By default the whole file becomes a single Document with metadata automatically containing the source field.
Inspect results : The script prints the number of loaded documents, the metadata dictionary, and the page content so the user can become familiar with the Document structure.
Optional LLM interaction : If the optional packages langchain-openai and the environment variable OPENAI_API_KEY are available, the script builds a simple chain that first summarizes the document and then answers a follow‑up question about its recommended use case. If the dependencies are missing, the script skips this step gracefully.
# example_2_file_loader.py
from langchain_community.document_loaders import TextLoader
import os
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
try:
from dotenv import load_dotenv
except ImportError:
load_dotenv = None
try:
from langchain_openai import ChatOpenAI
except ImportError:
ChatOpenAI = None
def main():
"""Demonstrate loading a local text file with TextLoader."""
file_path = os.path.join(os.path.dirname(__file__), "sample.txt")
if not os.path.exists(file_path):
print(f"Error: sample file '{file_path}' does not exist.")
return
print(f"--- 1. Preparing to load: {file_path} ---")
loader = TextLoader(file_path, encoding="utf-8")
print("
--- 2. Loading documents... ---")
docs = loader.load()
print("
--- 3. Checking results ---")
print(f"Loaded {len(docs)} document(s).")
if docs:
doc = docs[0]
from pprint import pprint
pprint(doc.metadata)
print("
--- Document content ---")
print(doc.page_content)
print("
Summary: TextLoader is the standard way to load .txt, .py, .md, etc.")
if not docs:
return
if load_dotenv:
load_dotenv()
interact_with_llm(docs[0])
def interact_with_llm(doc):
if ChatOpenAI is None:
print("
Tip: Install `langchain-openai` to see LLM interaction.")
return
if not os.getenv("OPENAI_API_KEY"):
print("
Tip: Set OPENAI_API_KEY to enable LLM demo.")
return
print("
--- 4. Using LLM on the document ---")
prompt = ChatPromptTemplate.from_messages([
("system", "You are a concise Chinese assistant summarizing documents."),
("human", "Summarize the following document in three sentences:
{document}")
])
model_name = os.getenv("OPENAI_MODEL_NAME", "deepseek-v3")
llm = ChatOpenAI(model=model_name, temperature=0)
chain = prompt | llm | StrOutputParser()
try:
summary = chain.invoke({"document": doc.page_content})
except Exception as exc:
print(f"
⚠️ LLM call failed: {exc}")
return
print("
--- LLM Summary ---")
print(summary.strip())
follow_up_prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful Chinese assistant."),
("human", "Answer the question based on the document: {question}
Document:
{document}")
])
follow_up_chain = follow_up_prompt | llm | StrOutputParser()
question = "What is the main usage scenario suggested by this document?"
try:
answer = follow_up_chain.invoke({"question": question, "document": doc.page_content})
except Exception as exc:
print(f"
⚠️ LLM call failed: {exc}")
return
print("
--- LLM Q&A ---")
print(f"Question: {question}")
print(f"Answer: {answer.strip()}")
if __name__ == "__main__":
# pip install langchain-community langchain-core langchain-openai python-dotenv
main()After running the script, observe the printed document metadata and content. To experience the LLM part, export OPENAI_API_KEY (or place it in a .env file) and ensure the required packages are installed.
References
How to: load PDF files [1] – https://python.langchain.com/docs/how_to/load_pdf
How to: load web pages [2] – https://python.langchain.com/docs/how_to/load_web_pages
How to: load CSV data [3] – https://python.langchain.com/docs/how_to/load_csv
How to: load data from a directory [4] – https://python.langchain.com/docs/how_to/load_directory
How to: load HTML data [5] – https://python.langchain.com/docs/how_to/load_html
How to: load JSON data [6] – https://python.langchain.com/docs/how_to/load_json
How to: load Markdown data [7] – https://python.langchain.com/docs/how_to/load_markdown
How to: load Microsoft Office data [8] – https://python.langchain.com/docs/how_to/load_microsoft_office
How to: write a custom document loader [9] – https://python.langchain.com/docs/how_to/custom_document_loader
BirdNest Tech Talk
Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
