Unlock Structured Data from Any Text with LangExtract – A Free Python LLM Tool
LangExtract is an open‑source Python library that uses LLMs to turn messy documents—such as medical records, contracts, novels, or news articles—into structured data with just a few lines of code and optional visualisation.
Recently discovered a super‑useful Python tool on GitHub – LangExtract, an open‑source text‑extraction library (Apache 2.0) that turns messy documents into structured data with just a few lines of code.
What is LangExtract?
LangExtract is a Python library that extracts structured information (names, dates, relationships, etc.) from unstructured text such as medical records, contracts, news, or novels. It leverages Google’s Gemini model (or a local Ollama model) and works by providing a natural‑language prompt and a few examples.
Super simple : No regex or model training required.
Precise provenance : Returns the original text positions, useful for audit scenarios.
Handles long documents : Processes millions of characters with smart chunking and multithreading.
Cool visualisation : Generates an HTML page with highlighted results.
Free and open source : Community‑driven and customizable.
Code examples
Four scenarios are demonstrated: extracting medication and dosage from medical records, clauses and dates from contracts, characters and relationships from a novel, and company names with industries from news articles.
Example 1 – Medications from medical records
Scenario : Extract drug names and dosages from a batch of doctor notes.
import langextract as lx
import textwrap
# Define the task
prompt = textwrap.dedent("""
Extract medication and dosage from the text.
Example:
Text: Patient takes Lisinopril 10mg daily, Metformin 500mg twice a day.
Output:
- Medication: Lisinopril, Dosage: 10mg
- Medication: Metformin, Dosage: 500mg
""")
text = "患者每天吃 Lisinopril 10mg 治高血压,Metformin 500mg 每天两次治糖尿病。"
result = lx.extract(text, prompt_description=prompt, model_id="gemini-2.5-flash")
lx.io.save_annotated_documents([result], "meds.jsonl")
html = lx.visualize("meds.jsonl")
with open("meds.html", "w") as f:
f.write(html)
print(result.extractions)Result (JSONL) :
[
{"entity": "Lisinopril", "type": "药物", "dosage": "10mg", "start_char": 8, "end_char": 19},
{"entity": "Metformin", "type": "药物", "dosage": "500mg", "start_char": 28, "end_char": 37}
]Opening meds.html shows the extracted entities highlighted in the original text.
Example 2 – Clauses and dates from contracts
Scenario : Pull clause titles and effective dates from legal contracts.
import langextract as lx
import textwrap
prompt = textwrap.dedent("""
Extract clause names and effective dates.
Example:
Text: Rental clause effective 2023‑01‑01, renewal clause effective 2024‑01‑01.
Output:
- Clause: Rental clause, Effective date: 2023‑01‑01
- Clause: Renewal clause, Effective date: 2024‑01‑01
""")
text = "保密条款2025年3月1日生效,违约条款2025年4月1日生效。"
result = lx.extract(text, prompt_description=prompt, model_id="gemini-2.5-flash")
lx.io.save_annotated_documents([result], "contract.jsonl")
html = lx.visualize("contract.jsonl")
with open("contract.html", "w") as f:
f.write(html)
print(result.extractions)Result (JSONL) :
[
{"entity": "保密条款", "type": "条款", "effective_date": "2025年3月1日", "start_char": 0, "end_char": 4},
{"entity": "违约条款", "type": "条款", "effective_date": "2025年4月1日", "start_char": 15, "end_char": 19}
]Opening contract.html highlights the clauses and dates.
Example 3 – Characters and relationships from a novel
Scenario : Analyse character relationships in "Romeo and Juliet".
import langextract as lx
import textwrap
prompt = textwrap.dedent("""
Extract characters and their relationships.
Example:
Text: Romeo loves Juliet, Juliet is a Capulet.
Output:
- Character: Romeo, Relation: loves, Target: Juliet
- Character: Juliet, Relation: belongs to, Target: Capulet family
""")
text = """
罗密欧深爱朱丽叶,但蒙太古家族和凯普莱特家族是世仇。
朱丽叶的表哥提伯尔特跟罗密欧打了一架。
"""
result = lx.extract(
text,
prompt_description=prompt,
model_id="gemini-2.5-flash",
max_workers=4,
num_passes=2
)
lx.io.save_annotated_documents([result], "romeo.jsonl")
html = lx.visualize("romeo.jsonl")
with open("romeo.html", "w") as f:
f.write(html)
print(result.extractions)Result (JSONL) :
[
{"entity": "罗密欧", "type": "人物", "relation": "爱上", "target": "朱丽叶", "start_char": 0, "end_char": 3},
{"entity": "朱丽叶", "type": "人物", "relation": "属于", "target": "凯普莱特家族", "start_char": 7, "end_char": 10},
{"entity": "提伯尔特", "type": "人物", "relation": "冲突", "target": "罗密欧", "start_char": 20, "end_char": 25}
]Opening romeo.html visualises the characters and their relationships.
Example 4 – Companies and industries from news
Scenario : Extract company names and their industries from financial news.
import langextract as lx
import textwrap
prompt = textwrap.dedent("""
Extract company names and industries.
Example:
Text: Apple releases new iPhone, Tesla expands battery production.
Output:
- Company: Apple, Industry: Technology
- Company: Tesla, Industry: Automotive/Energy
""")
text = "华为发布新芯片,字节跳动进军AI教育。"
result = lx.extract(text, prompt_description=prompt, model_id="gemini-2.5-flash")
lx.io.save_annotated_documents([result], "news.jsonl")
html = lx.visualize("news.jsonl")
with open("news.html", "w") as f:
f.write(html)
print(result.extractions)Result (JSONL) :
[
{"entity": "华为", "type": "公司", "industry": "科技", "start_char": 0, "end_char": 2},
{"entity": "字节跳动", "type": "公司", "industry": "科技/AI", "start_char": 10, "end_char": 14}
]Opening news.html highlights the companies and their industries.
How to get started in 5 minutes
Clone the repository:
git clone https://github.com/google/langextract.git && cd langextractInstall dependencies:
pip install -e ".[dev]" # for development
pip install -e ".[test]" # for testingSet the API key (required for Gemini):
echo 'LANGEXTRACT_API_KEY=YOUR_KEY' >> .env && echo '.env' >> .gitignoreRun any of the examples above, adjust the text and prompt, and view the generated HTML visualisation.
Tips and caveats
Medical use : LangExtract is a demonstration tool, not a diagnostic system.
Cost : Using Gemini may incur API fees; local Ollama models are free.
Large documents : Use Tier 2 quota to avoid rate‑limiting.
Conclusion
LangExtract acts as a smart assistant that converts chaotic text into tidy tables and visualisations, making it valuable for healthcare, legal, literary, and financial workflows.
Code Mala Tang
Read source code together, write articles together, and enjoy spicy hot pot together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
