How Qichacha Uses Large Language Models for Field‑Level Data Lineage

This article details Qichacha's technical journey of applying large language models to resolve field‑level data lineage challenges in a complex, multi‑source data environment, describing the motivation, architecture, practical implementation, engineering trade‑offs, and measurable outcomes.

DataFunSummit
DataFunSummit
DataFunSummit
How Qichacha Uses Large Language Models for Field‑Level Data Lineage

Background: Why Field Lineage Matters

Data governance is likened to a war map, where lineage acts as the battlefield diagram. Qichacha faces three typical scenarios—pre‑deployment change impact, abnormal metric investigation, and sensitive data audit—each highlighting the need for visible field‑level lineage to avoid costly manual tracing.

Why Traditional Tools Fall Short

Conventional lineage tools handle SQL well but struggle with UDFs, custom code, and semantic reasoning. They require manual annotation, produce uncertain results, and generate outputs unreadable by business users, making them unsuitable for Qichacha's heterogeneous, non‑standard data sources.

LLM‑Powered Opportunity

Large language models excel at natural‑language and code understanding, supporting multi‑language parsing (Java, Python, etc.) and reasoning over undocumented fields. They can infer lineage by reading comments, naming conventions, and surrounding code, effectively acting as an expert reviewer.

Core Architecture Overview

The solution consists of four layers:

Data collection & preprocessing: ingest task metadata, SQL scripts, code, and ETL logs from the data platform.

LLM parsing engine: feed collected artifacts to a prompt‑engineered LLM (or specialized Skill) that produces structured lineage output.

Post‑processing & graph construction: validate, deduplicate, align with the metadata dictionary, and store field‑level lineage for graph queries.

Verification & fallback: automatic rule checks, human spot‑checks, and a confidence‑score mechanism (MCP) to mitigate hallucinations.

Practical Implementation & Challenges

Challenge 1 – Cost & Efficiency : Token consumption is high for batch parsing. The team mitigates this by prioritizing critical dimensions, preprocessing code to trim irrelevant parts, using Flink UI‑assisted graph hints, asynchronous queues, caching, and updating only changed code.

Challenge 2 – Accuracy & Evaluation : Model outputs vary, requiring a quantitative quality framework. A rule‑based validator intercepts obvious errors (e.g., missing fields, type mismatches), human reviewers audit core paths, and a confidence score plus multi‑model A/B validation improve reliability.

Challenge 3 – Hallucinations : Models may fabricate mappings or link semantically similar but unrelated fields. The workflow forces the model to cite specific code lines, repeats parsing for consistency, and flags divergent results for manual confirmation.

Results and Impact

Field‑level lineage for code‑driven tasks moved from "invisible" to "partially visible," covering thousands of real‑time and tens of thousands of offline jobs. Benefits include rapid upstream impact analysis, metric root‑cause tracing, and automated sensitive‑data audit paths, turning post‑mortem debugging into proactive governance.

Future Outlook

Planned enhancements aim for smarter anomaly propagation inference, CI/CD integration for pre‑commit impact alerts, and natural‑language query interfaces powered by LLMs to let business users ask questions like "How is this metric calculated?".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkAILLMmetadatadata lineageData Governance
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.