How to Load Documents into LangChain: From Files to APIs

Learn how to use LangChain's Document Loaders to import data from files, web pages, databases, and APIs, understand the Document object structure, compare load() versus lazy_load(), and follow a step‑by‑step Python example that demonstrates loading, inspecting, and optionally processing documents with an LLM.

BirdNest Tech Talk
BirdNest Tech Talk
BirdNest Tech Talk
How to Load Documents into LangChain: From Files to APIs

In this tutorial the author explains how to bring external data into a LangChain application by using the built‑in Document Loaders. The first step when building a Q&A bot over internal documents is to load those documents into the program.

Document object

LangChain represents a piece of text as a Document object, which contains two core fields:

page_content (string) : the main text of the document.

metadata (dict) : a dictionary of auxiliary information such as source (file name, URL, table name), page (PDF page number), and row (CSV line number).

Example class definition:

class Document:
    page_content: str
    metadata: dict = {}

Why use Document Loaders?

Document loaders abstract away the tedious work of extracting text from many complex sources. The LangChain community provides hundreds of loaders covering common formats: plain text (.txt), CSV, JSON, Markdown, PDF, Word (.docx), Excel (.xlsx), web pages, databases, and collaboration tools like Notion or Confluence.

Using a loader typically requires only the source path or URL and a call to .load(), which returns a list of Document objects ready for downstream components such as text splitters or embedding models.

load() vs lazy_load()

load()

: reads the entire source into memory at once and returns a List[Document]. Suitable for small to medium datasets but can exhaust memory for large collections. lazy_load(): returns an iterator that yields documents one by one, allowing processing of very large files or datasets with minimal memory footprint.

The chapter’s example demonstrates both approaches with the most common loaders (web pages, plain‑text files, CSV).

Example 2: Using TextLoader to read a local text file

The script example_2_file_loader.py shows how to convert a local .txt file into LangChain Document objects and optionally feed them into a simple LLM workflow.

Prepare file path : The script builds an absolute path to sample.txt located in the same directory as the script and prints a friendly error if the file is missing.

Load documents : It creates a TextLoader(file_path, encoding="utf-8") instance and calls .load(). By default the whole file becomes a single Document with metadata automatically containing the source field.

Inspect results : The script prints the number of loaded documents, the metadata dictionary, and the page content so the user can become familiar with the Document structure.

Optional LLM interaction : If the optional packages langchain-openai and the environment variable OPENAI_API_KEY are available, the script builds a simple chain that first summarizes the document and then answers a follow‑up question about its recommended use case. If the dependencies are missing, the script skips this step gracefully.

# example_2_file_loader.py
from langchain_community.document_loaders import TextLoader
import os
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

try:
    from dotenv import load_dotenv
except ImportError:
    load_dotenv = None

try:
    from langchain_openai import ChatOpenAI
except ImportError:
    ChatOpenAI = None

def main():
    """Demonstrate loading a local text file with TextLoader."""
    file_path = os.path.join(os.path.dirname(__file__), "sample.txt")
    if not os.path.exists(file_path):
        print(f"Error: sample file '{file_path}' does not exist.")
        return
    print(f"--- 1. Preparing to load: {file_path} ---")
    loader = TextLoader(file_path, encoding="utf-8")
    print("
--- 2. Loading documents... ---")
    docs = loader.load()
    print("
--- 3. Checking results ---")
    print(f"Loaded {len(docs)} document(s).")
    if docs:
        doc = docs[0]
        from pprint import pprint
        pprint(doc.metadata)
        print("
--- Document content ---")
        print(doc.page_content)
    print("
Summary: TextLoader is the standard way to load .txt, .py, .md, etc.")
    if not docs:
        return
    if load_dotenv:
        load_dotenv()
    interact_with_llm(docs[0])

def interact_with_llm(doc):
    if ChatOpenAI is None:
        print("
Tip: Install `langchain-openai` to see LLM interaction.")
        return
    if not os.getenv("OPENAI_API_KEY"):
        print("
Tip: Set OPENAI_API_KEY to enable LLM demo.")
        return
    print("
--- 4. Using LLM on the document ---")
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a concise Chinese assistant summarizing documents."),
        ("human", "Summarize the following document in three sentences:

{document}")
    ])
    model_name = os.getenv("OPENAI_MODEL_NAME", "deepseek-v3")
    llm = ChatOpenAI(model=model_name, temperature=0)
    chain = prompt | llm | StrOutputParser()
    try:
        summary = chain.invoke({"document": doc.page_content})
    except Exception as exc:
        print(f"
⚠️ LLM call failed: {exc}")
        return
    print("
--- LLM Summary ---")
    print(summary.strip())
    follow_up_prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful Chinese assistant."),
        ("human", "Answer the question based on the document: {question}

Document:
{document}")
    ])
    follow_up_chain = follow_up_prompt | llm | StrOutputParser()
    question = "What is the main usage scenario suggested by this document?"
    try:
        answer = follow_up_chain.invoke({"question": question, "document": doc.page_content})
    except Exception as exc:
        print(f"
⚠️ LLM call failed: {exc}")
        return
    print("
--- LLM Q&A ---")
    print(f"Question: {question}")
    print(f"Answer: {answer.strip()}")

if __name__ == "__main__":
    # pip install langchain-community langchain-core langchain-openai python-dotenv
    main()

After running the script, observe the printed document metadata and content. To experience the LLM part, export OPENAI_API_KEY (or place it in a .env file) and ensure the required packages are installed.

References

How to: load PDF files [1] – https://python.langchain.com/docs/how_to/load_pdf

How to: load web pages [2] – https://python.langchain.com/docs/how_to/load_web_pages

How to: load CSV data [3] – https://python.langchain.com/docs/how_to/load_csv

How to: load data from a directory [4] – https://python.langchain.com/docs/how_to/load_directory

How to: load HTML data [5] – https://python.langchain.com/docs/how_to/load_html

How to: load JSON data [6] – https://python.langchain.com/docs/how_to/load_json

How to: load Markdown data [7] – https://python.langchain.com/docs/how_to/load_markdown

How to: load Microsoft Office data [8] – https://python.langchain.com/docs/how_to/load_microsoft_office

How to: write a custom document loader [9] – https://python.langchain.com/docs/how_to/custom_document_loader

PythonLLMLangChainData integrationDocument Loader
BirdNest Tech Talk
Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.