Big Data 28 min read

Unlocking Data Lineage: SQL Bloodline for Discovery, Governance & Protection

This article explains how SQL lineage (bloodline) technology can be leveraged in offline data warehouses to enable precise data discovery, automated tag propagation, fine‑grained data governance, column‑level TTL management, and dynamic masking for data protection, illustrating implementation steps, strategies, and real‑world use cases.

DataFunSummit
DataFunSummit
DataFunSummit
Unlocking Data Lineage: SQL Bloodline for Discovery, Governance & Protection

Introduction

This article describes the concrete application of data lineage (referred to as "bloodline") in offline data‑warehouse scenarios from a low‑level perspective.

Problem Background

Enterprises face two major challenges in data construction: (1) identifying data transmission links across expanding offline warehouses, including business‑line identification, sensitive‑data boundary marking, and storage‑cost control; (2) protecting sensitive data under increasingly strict security‑compliance requirements, which involves fine‑grained permission control and data masking.

Bloodline‑Based Data Discovery & Protection Solution

We propose a solution that uses SQL bloodline capabilities for data discovery and protection.

For discovery, table‑level and column‑level lineage are used to build a global lineage graph of all Hive tables, propagate tags from business‑known nodes, and enable precise data flow tracking.

For protection, the same lineage ability supports row‑column‑level permission extraction and dynamic masking, ensuring minimal‑privilege access and compliance.

SQL Bloodline Basics

SQL bloodline tracks how data moves and transforms through SQL statements. For example, given three tables (table1, table2, table3) and a join‑insert query, the lineage graph shows that table3.c1 originates from table1.a2 and table3.c2 from table2.b2.

How to Extract SQL Bloodline

The extraction process consists of:

Parsing and optimizing the SQL to generate a tree‑shaped execution plan.

Using the execution plan to locate the outermost column indexes.

Recursively traversing from those indexes down to the TableScan nodes, which represent the source tables/columns.

Optimization steps such as column pruning and constant folding improve extraction accuracy.

Strategy‑Based Diversified Bloodline Extraction

Beyond basic table/column lineage, we provide SDK‑wrapped strategies for:

Complex‑type column‑key lineage (e.g., JSON, Map, Struct).

Weak‑reference lineage covering WHERE, GROUP BY, ORDER BY, etc.

UDF‑customized lineage to ignore columns used only in conditional expressions.

Calculation‑relationship tagging (direct vs. expression) for columns.

Tag Propagation in Data Discovery

The three‑step tag propagation system:

Extract all SQL tasks and build a global lineage graph.

Start from business‑known nodes (tables/columns) and propagate tags downstream by analyzing SQL semantics.

Use the propagated tags to identify data usage and overall flow.

This enables precise labeling of millions of Hive tables and columns for fine‑grained operation and control.

Data Governance Using Bloodline

Traditional governance relies on HDFS access logs, which cannot distinguish column‑level usage. By leveraging column‑level lineage, we can compute column‑specific TTL (time‑to‑live) values, reducing storage cost. The process collects the longest partition usage range of a column’s downstream dependencies and recommends a TTL accordingly.

Bloodline in Data Protection

We apply bloodline to extract fine‑grained permissions:

SQL Permission Extraction : Parse SQL, generate execution plan, and extract table, column, and row permissions based on lineage.

Column Permission Rules : Verify permissions for returned columns and for columns appearing in equality filters.

Row Permission Rules : Focus on filter conditions where a column equals a constant.

Using the execution plan, we locate TableScan nodes, retrieve involved tables/columns, and map them back to the original data sources via lineage.

Dynamic Data Masking

Two masking timings are supported:

Pre‑computation masking : Insert masking operators before TableScan when sensitive columns are read.

Post‑computation masking : Insert masking at the outermost operator after full computation, using lineage to decide which result columns need masking.

This approach minimizes impact on SQL semantics while ensuring sensitive data is protected.

Conclusion

Bloodline capabilities enable comprehensive data discovery, tag propagation, fine‑grained governance, and secure data protection in large‑scale offline warehouses, providing a solid foundation for efficient data management and compliance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data GovernanceSQL lineageDynamic Maskingdata discoveryoffline data warehouse
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.