How Large Language Models Unlock Field‑Level Data Lineage at Scale
This talk explains how a data platform tackled massive, heterogeneous enterprise data by using large language models and prompt engineering to automatically extract field‑level lineage from SQL scripts, achieve over 80% coverage, and raise accuracy above 95%, dramatically cutting impact‑analysis time.
Background
Enterprise data platforms must manage multi‑source, massive, and rapidly changing datasets. Field‑level semantics become opaque, making impact analysis manual and error‑prone. Traditional lineage tools that only expose table‑level relationships cannot keep up with the scale.
Why Large Language Models
LLMs provide strong natural‑language understanding and generation, enabling them to interpret ambiguous column names and infer semantic relationships directly from SQL scripts without a separate schema‑extraction step.
System Architecture
Data ingestion & preprocessing : Collect raw SQL files from the data warehouse, normalize line endings, and strip comments.
LLM parsing engine : Calls an LLM (e.g., GPT‑4 or Claude) with a structured prompt that asks the model to output a list of lineage edges (source column → target column) together with business‑meaning tags.
Prompt‑engineering layer : Uses a few‑shot template that includes (a) a short schema excerpt, (b) an example SQL statement, and (c) the desired JSON output format. The prompt also instructs the model to “only emit valid JSON” and to “skip columns that cannot be resolved”.
Task orchestration : A lightweight scheduler reads SQL files, batches them, and invokes the LLM API with rate‑limiting to control cost.
Post‑processing & graph construction : Parses the JSON response, validates each edge against a whitelist of known tables, deduplicates, and stores the result in a Neo4j/JanusGraph instance for downstream queries.
Implementation Details
Example prompt (simplified):
{
"instruction": "Extract column‑level lineage from the following SQL and output JSON.",
"example": {
"sql": "SELECT a.id, b.amount FROM orders a JOIN payments b ON a.id = b.order_id",
"output": [{"source":"orders.id","target":"payments.order_id","meaning":"order identifier"}]
},
"sql": "{{SQL_CONTENT}}"
}To mitigate hallucinations, the pipeline applies two validation layers:
Schema validation : Each reported column is checked against the catalog; mismatches are flagged.
Human‑in‑the‑loop review : A small percentage of parsed statements (or any flagged record) is presented to data engineers for manual confirmation before being committed to the graph.
With these safeguards the system achieved >95 % correctness on a test set of 5 000 SQL statements.
Operational Results
Field‑level lineage coverage reached 80 % of core business reports.
Impact‑analysis time dropped from multi‑day manual effort to a few minutes of automated query.
Data‑asset reliability improved for petabyte‑scale tables, enabling faster feature development.
Challenges and Mitigations
Cost vs. latency : Batch size and model temperature were tuned to balance API spend with acceptable parsing throughput (≈200 SQL files/min).
Accuracy measurement : Precision and recall were computed against a manually curated ground‑truth lineage set.
Hallucination handling : Invalid edges are automatically discarded by schema validation; persistent errors trigger prompt refinement.
Future Directions
Extend the model from lineage discovery to reasoning, e.g., predicting downstream impact of schema changes.
Integrate automated risk alerts into CI/CD pipelines so that any detected breaking change triggers a ticket.
Expose a natural‑language query interface on top of the graph (e.g., “How is metric X calculated?”).
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
