Big Data 9 min read

Implementing Field Lineage in Spark SQL: A Technical Deep Dive

The article details how to add field‑lineage tracking to Spark SQL by creating a custom SparkSessionExtension that injects a check‑analysis rule and a parser, which capture INSERT statements, analyze the physical plan, and generate a JSON mapping of source‑to‑target fields for data governance.

vivo Internet Technology

Apr 20, 2022

Implementing Field Lineage in Spark SQL: A Technical Deep Dive

This article discusses the implementation of field lineage functionality in Spark SQL, which tracks the transformation process of fields during table processing. Field lineage is crucial for understanding data origins, destinations, and transformation relationships, significantly aiding data quality and governance.

The platform plans to migrate Hive tasks to Spark SQL while implementing field lineage functionality. Through extensive research, the team discovered that Spark supports extensibility, allowing users to extend SQL parsing, logical plan analysis, optimization, and physical plan formation without modifying Spark's source code.

The implementation focuses on injecting custom rules into SparkSessionExtensions, specifically utilizing Check Analysis Rules since they don't require modifying the logical plan tree. The solution involves creating a custom Spark extension class that injects field lineage check rules and a custom SQL parser.

The custom parser detects INSERT statements and stores the SQL text in a thread-local variable for later processing. The field lineage check rule then launches a separate thread to parse the SQL, analyze it through Spark's components (parser, analyzer, optimizer, planner), and obtain the physical plan.

The physical plan is preferred over the logical plan because it's more accurate and has a shorter, more direct lineage chain. The implementation iterates through the physical plan tree, handling different node types like Join, ExpandExec, Aggregate, Explode, and GenerateExec with special considerations.

Field lineage is categorized into two types: projection (SELECT query fields) and predication (WHERE query conditions). The implementation builds a point-to-point relationship mapping from source table fields to target table fields, creating a tree-like structure that traces field transformations from leaf nodes (original tables) to the root.

The article provides a detailed example showing how complex SQL queries with CTEs, joins, and aggregations are processed to generate field lineage information, including the resulting JSON structure that represents the field relationships and transformations.

The implementation demonstrates that Spark SQL's extensibility allows for sophisticated data lineage tracking without modifying core Spark components, providing valuable insights for data governance and quality assurance in big data processing pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Quality Data Governance Spark SQL SQL Extensions big data processing data transformation Field Lineage Logical Plan Physical Plan

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.