Big Data 20 min read

Comprehensive Guide to Data Lineage: Model Design, Optimization, and Use Cases at ByteDance

This article presents an in‑depth overview of data lineage at ByteDance, detailing the design of storage, display, abstraction, implementation, and storage layers, optimization techniques for real‑time updates and queries, open export methods, practical use cases across asset, development, governance, and security domains, and future directions.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Comprehensive Guide to Data Lineage: Model Design, Optimization, and Use Cases at ByteDance

Data lineage is a fundamental capability that helps users discover, understand, and leverage data. This article focuses on the storage and export of data lineage, sharing model design, optimization strategies, and practical use cases from ByteDance, many of which are offered through Volcano Engine DataLeap.

1. Data Lineage Model – Challenges The rapid expansion of business, user base, and data warehouse construction leads to non‑linear growth in metadata, raising challenges in scalability, performance, timeliness, and business empowerment.

2. Data Lineage Model – Presentation Layer ByteDance maintains various metadata types (Hive, ClickHouse, Kafka, ES, Redis) in a unified platform, visualizing assets and their upstream/downstream relationships.

3. Data Lineage Model – Abstraction Layer The abstraction layer consists of asset nodes (circles) and task nodes (diamonds). Examples include a FlinkSQL job consuming a Kafka topic and writing to a Hive table, schema propagation across topics and tables, and hierarchical task‑asset connections.

4. Data Lineage Model – Implementation Layer Implemented primarily with Apache Atlas, extending its type system with ByteDance‑specific attributes and sub‑task definitions to store task‑related metadata.

5. Data Lineage Model – Storage Layer Uses Atlas’s native graph database JanusGraph (backed by HBase) and, when needed, switches to OLTP databases such as MySQL for cost or performance reasons.

Data Lineage Optimization

1. Real‑time Update Optimization Two approaches were evaluated: (a) engine‑side hooks that push lineage after DAG construction, offering independence but high integration cost; (b) task‑platform notifications via API/MQ, providing better extensibility. ByteDance adopted the latter, reducing update latency from days to minutes.

2. Query Optimization To improve multi‑node lineage queries, batch query interfaces were added to JanusGraph, and asynchronous batch processing was introduced, yielding noticeable performance gains for high‑frequency asset queries.

3. Open Export Lineage data can be exported to Excel, warehouse tables, APIs, or as change‑feed topics, allowing downstream consumers to choose the most suitable method for their needs.

Data Lineage Use Cases

1. Asset Domain Lineage supports asset hotness calculation (similar to PageRank) and helps users trace data origins for development or troubleshooting.

2. Development Domain Enables impact analysis (pre‑change) and root‑cause analysis (post‑change) by traversing upstream/downstream lineage.

3. Governance Domain Supports link‑status tracking for critical tasks and data‑warehouse governance such as removing redundant tables.

4. Security Domain Enforces security‑level rules across lineage (e.g., downstream assets must have higher security levels) and automates security‑tag propagation.

Future Outlook

1. Technical Trends Generalized lineage parsing (standard SQL engine), non‑intrusive collection for non‑SQL jobs (e.g., JAR tasks), and temporal lineage to capture evolution over time.

2. Application Trends Standardization of lineage APIs, end‑to‑end lineage across front‑end, back‑end, and reporting layers, and providing full‑link lineage capabilities in the cloud.

Overall, the comprehensive data lineage framework at ByteDance reduces development cost for new lineage links, simplifies updates and deletions, and enables a wide range of business scenarios across assets, development, governance, and security.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

metadataData LineageJanusGraphApache Atlas
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.