Big Data 14 min read

Why Build Your Own Data Lineage Engine? Lessons from Apache Atlas to Duo-Lineage

This article explains what data lineage is, why it is essential for data governance in large‑scale big‑data platforms, compares Apache Atlas with a custom solution, and details the technical choices, architecture, and performance optimizations behind the self‑built duo‑lineage system.

Fangduoduo Tech
Fangduoduo Tech
Fangduoduo Tech
Why Build Your Own Data Lineage Engine? Lessons from Apache Atlas to Duo-Lineage

1. What Is Data Lineage

Data lineage, also called data provenance, describes the source and derivation relationships of data. It answers how a piece of data was created, which tables and fields contributed to it, and the upstream (dependencies) and downstream (dependents) directions.

Upstream dependency shows which tables the current table directly or indirectly depends on. Downstream dependency shows which tables depend on the current table.

Example: When a salesperson finds an error in last month’s performance data, a data warehouse developer traces the issue by locating the report’s table‑field, checking its creation script, then moving upstream to the tables‑fields it depends on, repeating the process until the root cause is found.

Downstream example: Before decommissioning a table, you must verify whether any other tables depend on it; lack of downstream dependencies does not guarantee the table is unused.

2. Why Build Data Lineage

Data governance becomes necessary as a data platform grows; manual Excel or document tracking cannot keep up with hundreds of tables and changing personnel. Without governance, duplicate metrics, inconsistent data, and siloed development arise.

Lineage is a core component of data governance, enabling data maps, quality checks, and lifecycle management. It also supports metric derivation and automated anomaly detection by tracing abnormal metric values back to their source tables and fields.

3. Apache Atlas?

1. Table‑level lineage

Hive tools can provide basic table‑level lineage.

2. Field‑level lineage

For finer granularity, Apache Atlas offers metadata management and lineage capabilities, supporting many big‑data components (Hive, Sqoop, Falcon, Storm, HBase, etc.) and storing data in HBase with Solr indexing and JanusGraph for graph representation.

Our first lineage implementation used Apache Atlas, but we encountered limited support for complex queries and performance issues due to its underlying index architecture.

4. Self‑Built Duo‑Lineage

1. Why Re‑invent the Wheel

Data governance requires lineage to be collected before script execution; Atlas gathers lineage passively via Hive hooks, causing latency. A proactive approach enables automatic validation of naming conventions, layer separation, and dependency discovery, and feeds downstream systems such as release pipelines and intelligent schedulers.

2. Technical Selection

We focus on two data sources: Sqoop scripts (converted to SQL) and various SQL dialects (Hive, Spark, Flink, etc.). Parsing SQL into an abstract syntax tree (AST) is the key step. After evaluating options—Alibaba Druid, Uber’s queryparser, Hive’s built‑in parser, and Antlr4—we chose Alibaba Druid for its simplicity and sufficient Hive support.

3. Implementation

The process parses the AST, traverses it, and extracts lineage information. Sqoop scripts are first converted to SQL. Unsupported syntax is stripped if it does not affect lineage. Statements are categorized (CREATE TABLE, INSERT, SELECT, DROP TABLE); temporary tables are ignored and their lineage passed through.

Parsing focuses on the FROM clause as the source, then WHERE, GROUP BY, and SORT BY to capture global field dependencies. The resulting lineage is stored in a dedicated database.

System components include:

Hook plugins (e.g., Hive hook) to capture executing SQL.

lineage‑api: Dubbo client interface for external calls.

lineage‑server: Dubbo and RESTful services exposing lineage data.

Unit tests verify each script type to prevent regressions.

4. Query Performance

To achieve sub‑10 ms response times for complex graphs involving hundreds of tables, we replaced traditional Redis caching with Hazelcast distributed in‑memory caching, which synchronizes updates automatically and provides distributed locks.

At startup, the system loads direct upstream and downstream relationships into an in‑memory map; queries recursively traverse this map, handling cyclic dependencies to avoid infinite loops.

5. Future Outlook

Open‑source the project to fill the gap of lineage tools.

Leverage the AST to extract field‑level sub‑queries, allowing users to view the computation logic of a single field without the noise of other fields.

Big Datadata lineageSQL parsingdistributed cachedata governanceApache Atlas
Fangduoduo Tech
Written by

Fangduoduo Tech

Sharing Fangduoduo's product and tech insights, delivering value, and giving back to the open community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.