Big Data 7 min read

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data lineage in big‑data environments tracks how data moves and transforms—from source tables through SQL processing to final storage—enabling management tasks such as domain segmentation, performance tuning, anomaly detection, and dependency verification, with implementations ranging from simple regex extraction to robust AST parsing and optimization, as used by tools like Alibaba DataWorks and Apache Atlas.

DeWu Technology

Nov 30, 2022

Fundamentals and Implementation of Data Lineage in Big Data Environments

In the era of big data, data sources are abundant and data grows explosively. Data lineage describes the relationships among data as it is generated, processed, fused, transferred, and eventually discarded, forming metadata that enables effective data management and application.

The article outlines common use cases of data lineage, including business domain segmentation, scheduling performance improvement, anomaly detection, data warehouse link optimization, and verification of scheduling dependencies.

Two main implementation approaches are discussed. The first uses regular expressions to extract source and target tables from SQL statements, e.g.:

source_table_regex = re.compile(r"(?:from|join)\s+(\S*)(?:\s+|;)", re.IGNORECASE)
target_table_regex = re.compile(r"insert\s+(?:into|overwrite)\s+table\s+(\S*)\s+", re.IGNORECASE)

However, this method fails for commented or string‑contained keywords, such as:

select * 
--from tableA
from tableB;

select * from tableA
where description = "from Excel";

The second, more robust approach parses the SQL into an Abstract Syntax Tree (AST) using tools like ANTLR. The article details Hive SQL’s parsing pipeline: lexical analysis, syntax analysis, AST generation, query block extraction, operator tree construction, logical and physical optimization, and final MapReduce job translation.

Key optimization steps include pruning irrelevant nodes (e.g., ORDER BY, LIMIT) and simplifying WHERE/HAVING subqueries with equivalence transformations (e.g., replacing conditions with 1=1) to reduce traversal complexity.

Finally, the article emphasizes that many commercial and open‑source products (e.g., Alibaba DataWorks, ByteDance DataLeap, Apache Atlas) implement these principles, and understanding the underlying mechanisms helps users make better use of data lineage tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data AST Metadata data lineage Hive SQL parsing

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.