Implementing Field Lineage in Spark SQL: A Technical Deep Dive
The article details how to add field‑lineage tracking to Spark SQL by creating a custom SparkSessionExtension that injects a check‑analysis rule and a parser, which capture INSERT statements, analyze the physical plan, and generate a JSON mapping of source‑to‑target fields for data governance.
This article discusses the implementation of field lineage functionality in Spark SQL, which tracks the transformation process of fields during table processing. Field lineage is crucial for understanding data origins, destinations, and transformation relationships, significantly aiding data quality and governance.
The platform plans to migrate Hive tasks to Spark SQL while implementing field lineage functionality. Through extensive research, the team discovered that Spark supports extensibility, allowing users to extend SQL parsing, logical plan analysis, optimization, and physical plan formation without modifying Spark's source code.
The implementation focuses on injecting custom rules into SparkSessionExtensions, specifically utilizing Check Analysis Rules since they don't require modifying the logical plan tree. The solution involves creating a custom Spark extension class that injects field lineage check rules and a custom SQL parser.
The custom parser detects INSERT statements and stores the SQL text in a thread-local variable for later processing. The field lineage check rule then launches a separate thread to parse the SQL, analyze it through Spark's components (parser, analyzer, optimizer, planner), and obtain the physical plan.
The physical plan is preferred over the logical plan because it's more accurate and has a shorter, more direct lineage chain. The implementation iterates through the physical plan tree, handling different node types like Join, ExpandExec, Aggregate, Explode, and GenerateExec with special considerations.
Field lineage is categorized into two types: projection (SELECT query fields) and predication (WHERE query conditions). The implementation builds a point-to-point relationship mapping from source table fields to target table fields, creating a tree-like structure that traces field transformations from leaf nodes (original tables) to the root.
The article provides a detailed example showing how complex SQL queries with CTEs, joins, and aggregations are processed to generate field lineage information, including the resulting JSON structure that represents the field relationships and transformations.
The implementation demonstrates that Spark SQL's extensibility allows for sophisticated data lineage tracking without modifying core Spark components, providing valuable insights for data governance and quality assurance in big data processing pipelines.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.