How ByteDance Designs Scalable Data Lineage for Big Data Governance
This article explains ByteDance's data lineage architecture, covering data sources, processing pipelines, graph‑based modeling, key application scenarios, quality metrics such as accuracy, coverage and timeliness, and future directions for improving and standardizing lineage across its massive data platform.
ByteDance Data Lineage Overview
Data lineage describes the source, transformation, and destination of data across processing stages and is a foundational capability for leveraging data value within an organization.
Data Sources
ByteDance data originates from two sources: endpoint data (APP/Web SDKs sent through LogService to MQ) and business data (operations from APP/Web/third‑party services stored in RDS and then streamed to MQ).
Processing Flow
Data in MQ is split, transformed, and routed.
Offline warehouse (Hive) consumes data via HiveSQL or Spark jobs and writes to downstream stores such as ClickHouse.
Real‑time warehouse (MQ) processes data with FlinkSQL or generic Flink jobs, performing side‑joins before writing to various stores.
Typical data outputs include metric systems, reporting systems, and data services.
Application Scenarios
Lineage supports multiple scenarios such as data asset popularity ranking, asset context understanding, impact analysis for developers, root‑cause attribution, link‑state tracking, warehouse governance, and security compliance checks.
Data asset: calculate reference heat based on downstream lineage.
Data development: notify downstream owners when upstream tasks change.
Attribution analysis: trace issues to upstream tasks.
Governance: track core link status.
Security: ensure downstream assets do not have lower security levels than upstream.
Overall Design
The system consists of three parts: task ingestion, lineage parsing, and data export.
Task Ingestion
Provides two pipelines – near‑real‑time (tasks publish changes to MQ) and offline (periodic API pull of full or incremental task info).
Lineage Parsing
Defines a unified LineageInfo model. Different TaskTypes (SQL, DTS, generic) have custom parsers, with fallback strategies.
Data Export
Exports LineageInfo to the Data Catalog via API, MQ incremental messages, or offline warehouse exports.
Data Model
Uses a graph with two node types (data nodes and task nodes) and two edge types (data‑to‑task consumption, task‑to‑data production). Column‑level lineage is modeled by adding redundant task nodes.
Metrics
Three key metrics evaluate lineage quality: accuracy (percentage of tasks whose upstream/downstream matches actual lineage), coverage (fraction of assets with at least one lineage link), and timeliness (end‑to‑end latency from task change to lineage store update).
Future Directions
Continuously improve lineage accuracy through automated validation and manual correction.
Standardize lineage to support both data‑level and application‑level graphs.
Enhance ecosystem support, including generic SQL lineage engines and integration with open‑source or cloud products.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
