Artificial Intelligence 10 min read

How to Turn LLM Text into Structured Data with LangChain Output Parsers

This article explains why LLMs output plain text, introduces LangChain output parsers as the bridge to structured data, details their workflow, reviews built‑in parsers, and walks through a complete Python example that builds a prompt‑model‑parser chain to generate a JSON‑based joke.

BirdNest Tech Talk

Oct 8, 2025

How to Turn LLM Text into Structured Data with LangChain Output Parsers

Why LLM Output Needs Parsing

Large language models (LLMs) emit raw text strings that are easy for humans to read but hard for programs to consume directly; developers often need JSON objects, lists, dates, or custom class instances for downstream logic.

What Output Parsers Do

Output parsers are components that convert the raw textual response of an LLM into structured data, acting as the bridge between the model and application code.

Typical Parser Workflow

Provide format instructions : Most parsers expose a get_format_instructions() method that returns a textual directive (e.g., "return your answer as a markdown JSON code block"). This directive must be inserted into the prompt so the model knows the required format.

Parse the output : After the model returns a string, the parser’s parse() method attempts to convert it into the desired structure, raising OutputParserException if the conversion fails.

Common Built‑In Parsers in LangChain

StrOutputParser

: Returns the model output unchanged as a string. JsonOutputParser: Parses a JSON string into a Python dict. PydanticOutputParser: Takes a Pydantic model definition and produces an instance of that model, adding validation and type hints. CommaSeparatedListOutputParser: Turns a comma‑separated string like "red, green, blue" into a Python list. DatetimeOutputParser: Extracts date‑time information from the text.

Using an Output Parser in an LCEL Chain

Because an output parser implements the Runnable interface, it can be placed as the final step of a LangChain Expression Language (LCEL) chain: chain = prompt | model | output_parser When the chain runs, the prompt and model generate a string, which is then handed to output_parser.parse() to produce a structured object.

Example: Generating a Structured Joke with JsonOutputParser

The following script demonstrates the full process.

Environment preparation : Load .env and read OPENAI_API_KEY; abort with a clear error if the key is missing.

Define data structure : Create a Pydantic model Joke with fields setup and punchline, each annotated with a Chinese description for readability.

Build the parser : Instantiate JsonOutputParser(pydantic_object=Joke), which automatically generates format instructions that enforce the expected JSON schema.

Create the prompt template : Use PromptTemplate and inject the parser’s format_instructions via partial_variables, ensuring every model call receives the same structural constraint.

Assemble the LCEL chain : Combine prompt | ChatOpenAI | parser. The model (e.g., deepseek-v3 with temperature=0.7) will output a JSON string that the parser instantly converts to a Joke instance, eliminating the need for manual json.loads.

Run and handle robustness : Invoke the chain with

chain.invoke({"query": "给我讲一个关于程序员的笑话"})

, print the result type and content, and access fields like result["setup"]. Wrap the call in try/except to catch OutputParserException and suggest that the model may have ignored the format instructions.

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in a .env file.")

class Joke(BaseModel):
    setup: str = Field(description="笑话的铺垫部分")
    punchline: str = Field(description="笑话的点睛之笔")

parser = JsonOutputParser(pydantic_object=Joke)
format_instructions = parser.get_format_instructions()

prompt_template_str = """
根据用户的问题，生成一个笑话。
{format_instructions}

用户问题: {query}
"""

prompt = PromptTemplate(
    template=prompt_template_str,
    input_variables=["query"],
    partial_variables={"format_instructions": format_instructions},
)

model = ChatOpenAI(model="deepseek-v3", temperature=0.7)
chain = prompt | model | parser

query = "给我讲一个关于程序员的笑话"
try:
    result = chain.invoke({"query": query})
    print("输出类型:", type(result))
    print("笑话的铺垫:", result.get('setup'))
    print("笑话的点睛:", result.get('punchline'))
except Exception as e:
    print("执行链时出错:", e)
    print("这可能是因为模型没有严格遵循格式指令，导致JSON解析失败。")

References

How to: parse text from message objects – LangChain docs

How to: use output parsers to parse an LLM response into structured format – LangChain docs

How to: parse JSON output – LangChain docs

How to: parse XML output – LangChain docs

How to: parse YAML output – LangChain docs

How to: retry when output parsing errors occur – LangChain docs

How to: try to fix errors in output parsing – LangChain docs

Python LLM LangChain PromptEngineering Pydantic OutputParser StructuredData

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.