How Qichacha Uses Large Language Models for Field‑Level Data Lineage
This article details Qichacha's technical journey of applying large language models to resolve field‑level data lineage challenges in a complex, multi‑source data environment, describing the motivation, architecture, practical implementation, engineering trade‑offs, and measurable outcomes.
Background: Why Field Lineage Matters
Data governance is likened to a war map, where lineage acts as the battlefield diagram. Qichacha faces three typical scenarios—pre‑deployment change impact, abnormal metric investigation, and sensitive data audit—each highlighting the need for visible field‑level lineage to avoid costly manual tracing.
Why Traditional Tools Fall Short
Conventional lineage tools handle SQL well but struggle with UDFs, custom code, and semantic reasoning. They require manual annotation, produce uncertain results, and generate outputs unreadable by business users, making them unsuitable for Qichacha's heterogeneous, non‑standard data sources.
LLM‑Powered Opportunity
Large language models excel at natural‑language and code understanding, supporting multi‑language parsing (Java, Python, etc.) and reasoning over undocumented fields. They can infer lineage by reading comments, naming conventions, and surrounding code, effectively acting as an expert reviewer.
Core Architecture Overview
The solution consists of four layers:
Data collection & preprocessing: ingest task metadata, SQL scripts, code, and ETL logs from the data platform.
LLM parsing engine: feed collected artifacts to a prompt‑engineered LLM (or specialized Skill) that produces structured lineage output.
Post‑processing & graph construction: validate, deduplicate, align with the metadata dictionary, and store field‑level lineage for graph queries.
Verification & fallback: automatic rule checks, human spot‑checks, and a confidence‑score mechanism (MCP) to mitigate hallucinations.
Practical Implementation & Challenges
Challenge 1 – Cost & Efficiency : Token consumption is high for batch parsing. The team mitigates this by prioritizing critical dimensions, preprocessing code to trim irrelevant parts, using Flink UI‑assisted graph hints, asynchronous queues, caching, and updating only changed code.
Challenge 2 – Accuracy & Evaluation : Model outputs vary, requiring a quantitative quality framework. A rule‑based validator intercepts obvious errors (e.g., missing fields, type mismatches), human reviewers audit core paths, and a confidence score plus multi‑model A/B validation improve reliability.
Challenge 3 – Hallucinations : Models may fabricate mappings or link semantically similar but unrelated fields. The workflow forces the model to cite specific code lines, repeats parsing for consistency, and flags divergent results for manual confirmation.
Results and Impact
Field‑level lineage for code‑driven tasks moved from "invisible" to "partially visible," covering thousands of real‑time and tens of thousands of offline jobs. Benefits include rapid upstream impact analysis, metric root‑cause tracing, and automated sensitive‑data audit paths, turning post‑mortem debugging into proactive governance.
Future Outlook
Planned enhancements aim for smarter anomaly propagation inference, CI/CD integration for pre‑commit impact alerts, and natural‑language query interfaces powered by LLMs to let business users ask questions like "How is this metric calculated?".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
