Big Data 14 min read

How ByteDance Designs Scalable Data Lineage for Big Data Governance

This article explains ByteDance's data lineage architecture, covering data sources, processing pipelines, graph‑based modeling, key application scenarios, quality metrics such as accuracy, coverage and timeliness, and future directions for improving and standardizing lineage across its massive data platform.

Volcano Engine Developer Services

Mar 15, 2022

How ByteDance Designs Scalable Data Lineage for Big Data Governance

ByteDance Data Lineage Overview

Data lineage describes the source, transformation, and destination of data across processing stages and is a foundational capability for leveraging data value within an organization.

Data Sources

ByteDance data originates from two sources: endpoint data (APP/Web SDKs sent through LogService to MQ) and business data (operations from APP/Web/third‑party services stored in RDS and then streamed to MQ).

Processing Flow

Data in MQ is split, transformed, and routed.

Offline warehouse (Hive) consumes data via HiveSQL or Spark jobs and writes to downstream stores such as ClickHouse.

Real‑time warehouse (MQ) processes data with FlinkSQL or generic Flink jobs, performing side‑joins before writing to various stores.

Typical data outputs include metric systems, reporting systems, and data services.

Application Scenarios

Lineage supports multiple scenarios such as data asset popularity ranking, asset context understanding, impact analysis for developers, root‑cause attribution, link‑state tracking, warehouse governance, and security compliance checks.

Data asset: calculate reference heat based on downstream lineage.

Data development: notify downstream owners when upstream tasks change.

Attribution analysis: trace issues to upstream tasks.

Governance: track core link status.

Security: ensure downstream assets do not have lower security levels than upstream.

Overall Design

The system consists of three parts: task ingestion, lineage parsing, and data export.

Task Ingestion

Provides two pipelines – near‑real‑time (tasks publish changes to MQ) and offline (periodic API pull of full or incremental task info).

Lineage Parsing

Defines a unified LineageInfo model. Different TaskTypes (SQL, DTS, generic) have custom parsers, with fallback strategies.

Data Export

Exports LineageInfo to the Data Catalog via API, MQ incremental messages, or offline warehouse exports.

Data Model

Uses a graph with two node types (data nodes and task nodes) and two edge types (data‑to‑task consumption, task‑to‑data production). Column‑level lineage is modeled by adding redundant task nodes.

Metrics

Three key metrics evaluate lineage quality: accuracy (percentage of tasks whose upstream/downstream matches actual lineage), coverage (fraction of assets with at least one lineage link), and timeliness (end‑to‑end latency from task change to lineage store update).

Future Directions

Continuously improve lineage accuracy through automated validation and manual correction.

Standardize lineage to support both data‑level and application‑level graphs.

Enhance ecosystem support, including generic SQL lineage engines and integration with open‑source or cloud products.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Metadata data modeling data lineage Data Governance

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.