Big Data 27 min read

How Unified Metadata Lineage Transforms Big Data Governance and Security

This article introduces the comprehensive design and evolution of a unified metadata lineage platform for big data, covering background, data processing chain, lineage models, system architecture, quality metrics, application scenarios, and future plans to enhance data governance, quality, and security.

Bilibili Tech
Bilibili Tech
Bilibili Tech
How Unified Metadata Lineage Transforms Big Data Governance and Security

Big Data Lineage Overview

As the company's business expands, the amount of data entering the big data platform increases, making the relationships between data sources, outputs, and usage increasingly complex. Building metadata lineage is the most effective way to clarify these relationships, showing data origins, flows, and change history, and supporting data governance, quality, and security.

Background & Goals

In the early stage of metadata lineage construction, lineage information was scattered across data synchronization, development, and application modules. Missing lineage from any module could cause unforeseen downstream impacts during data transformation, affect quality baselines, and hinder security monitoring. To address this, a unified metadata lineage platform was created to collect, model, and store lineage centrally, allowing users to view the full data processing chain from a global perspective.

Data Processing Full Chain

The data processing chain consists of three stages:

Data Ingestion

Tracking data from APP, WEB, and server endpoints, reported as messages and parsed into the platform.

Business data from MySQL, TiDB, Boss, Taishan, etc., synchronized to Hive/ClickHouse.

In‑warehouse Processing

Offline processing via Spark SQL, Spark JAR, Presto SQL, Python scripts.

Real‑time processing via Flink SQL, Flink JAR on Kafka streams.

Data Application / Export

Processed data is used in metrics services, reporting platforms, AI training, etc.

Processed data can be exported back to business storage for further use.

This chain represents the core lineage of the big data platform, from ingestion to application.

Lineage Models

Two standard lineage models were developed:

Model 1: Represents data processing pipelines where source and target are entity nodes and builder is the processing logic edge, forming a complete processing lineage.

Model 2: Represents dependency between two entities, where the target (or its builder) depends on the source.

Both models are stored in a relational database for later querying and graph synchronization.

Lineage Evolution

Since 2021, the platform has iteratively improved lineage coverage:

Before 2021 – chaotic period with fragmented table‑level lineage.

2021‑2023 – initial unified lineage system covering table‑level and field‑level lineage.

2023‑2024 – expansion to task‑level lineage, real‑time field lineage via Sqlscan, and a sample‑case library for error correction.

2024‑present – leveraging full‑linkage lineage for governance, cost reduction, and security, including automatic sensitive‑field tagging and operator‑level lineage.

As of 2025‑05‑21, the platform tracks 14.72 million lineage records across 24 types, with daily changes exceeding 1.4 million (10% of total).

Lineage System Architecture

Initial Architecture & Issues

Early on, lineage was reported by business teams via an SDK, sent to Kafka, and consumed asynchronously. Advantages: high real‑time, low implementation cost. Drawbacks: operational difficulty, missing reports, Kafka loss, and queue blocking.

Architecture Evolution

Version V1 (2022) shifted to a pull‑based collection service with standardized interfaces, supporting incremental pulls and automatic back‑fill based on startOffset. It introduced priority queues (P0 < 20 s, P1 < 5 min, P2 < 1 h) and multi‑level storage.

Version V2 added deep SQL AST parsing to extract field‑level, operator‑level, and UDF lineage, integrating HDFS audit logs for non‑SQL tasks and a hierarchical queue mechanism.

Operator‑Level Lineage Parsing

SQL is parsed into an AST; recursive traversal extracts operators such as FROM, JOIN, WHERE, etc., enabling fine‑grained lineage. Complex SQL features (CTE, temporary views, nested subqueries) are also handled.

Non‑SQL Task Lineage

Two approaches are used: manual lineage entry during task registration, and trace‑ID injection with NameNode log correlation to automatically infer read/write paths. This covers about 60% of non‑SQL task chains.

Lineage Quality

Timeliness

SQL task lineage is refreshed every 30 seconds (seconds‑level), while non‑SQL task lineage is refreshed daily (day‑level). Application lineage can be customized via cron expressions.

Coverage

Table‑level coverage reaches 96.81% for data‑platform tasks and 68.2% for application entities. Field‑level coverage is 54.66% for platform tasks and 81.59% for applications.

Accuracy

Based on a sample‑case library, the current lineage accuracy is 92.3%.

Application Scenarios

Data Discovery & Impact Analysis – Users can query upstream/downstream lineage for any table or field, visualize full‑linkage impact, and quickly identify responsible owners.

Data Quality – Baseline guarantees and quality checks are tied to lineage, ensuring high‑priority tasks receive resources and anomalies are blocked downstream.

Data Security – Field‑level and operator‑level lineage propagate sensitivity tags, enabling audit and enforcement of security policies for reports and tables.

Data Governance – Lineage helps locate high‑value low‑frequency data, identify idle assets, and assign ownership, reducing storage costs and improving governance efficiency.

Future Plans

Lineage Infrastructure Roadmap

Row‑level lineage for event‑level tracking, extending beyond current table/field/operator granularity.

Broader scenario coverage, including cross‑department AI training pipelines and service‑level data usage.

Lineage Application Roadmap

Warehouse model governance to eliminate redundant models and low‑efficiency pipelines.

AI model training chain governance to trace samples, features, and models, enabling cleanup of unused data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

architectureBig DataData qualityData GovernanceData Securitymetadata lineage
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.