Big Data 17 min read

Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices

This article presents a detailed overview of Douyin Group's Data Asset Management Platform, focusing on the evolution, architecture, modeling, metrics, and application scenarios of its large‑scale data lineage system, and outlines future directions for full‑coverage, fine‑grained lineage capabilities.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices

The article introduces Douyin Group's one‑stop Data Asset Management Platform, emphasizing the shift from traditional metadata to a broader concept of data assets to better serve precise data discovery needs across complex business scenarios.

The platform supports diverse data sources, collects metadata into a unified lake, and enriches assets through active metadata, evaluation, and AI‑driven search, enabling portal, recommendation, and other product capabilities.

Four main topics are covered:

Overall introduction of Douyin Group's data lineage.

Architecture of the lineage system.

Application scenarios of lineage.

Future outlook.

1. Data Lineage Overview

Goals include building full‑coverage, real‑time, accurate lineage to empower various scenarios; lineage is seen as the core of metadata, essential for efficient data platforms.

Key motivations: link visibility, quality assurance, security, and cost reduction.

Lineage covers three categories: source/ingestion lineage, production chain lineage (real‑time and offline warehouses), and application‑side lineage, with both table‑level and field‑level granularity.

2. Lineage Model Abstraction

Two modeling approaches are discussed: a static node‑edge model (fast reads, slow updates) and a dynamic task‑centric model (fast updates, slower reads). A generalized model introduces three entity types—DataStore (e.g., Hive table), Column, and Process (task)—and six relationship types, balancing storage and query efficiency.

3. Lineage Quality Metrics

Three primary indicators form a "Lineage Quality Score": coverage (tasks parsed), accuracy (correctness of parsed lineage), and completeness (full coverage of relationships).

4. Lineage Ecosystem

The ecosystem includes data source ingestion, metadata collection, storage (graph databases like JanusGraph, Neo4j, NebulaGraph), and analysis services for both real‑time and offline use cases.

5. Unified Parsing Service

Parsing is critical; the solution combines ANTLR (lexical/grammar parsing) and Calcite (SQL optimization) to handle multiple dialects and complex scripts, converting parse trees into SQLNode/RelNode for lineage extraction.

6. Lineage Access Services

Production lineage: extracts table‑to‑table and column‑to‑column dependencies, including operator‑level lineage.

Cross‑region lineage: aggregates local lineage and propagates it across regions via a message bus, handling global tables via catalog lookup.

Application lineage: captures end‑to‑end relationships from low‑code platforms, custom services, HTTP/RPC calls, and trace logs, addressing challenges of log volume and accuracy.

7. Application Scenarios

Four major domains are highlighted:

Data development – impact assessment, field‑level debugging, rapid task testing, upstream change alerts, and model migration.

Data governance – low‑value asset identification, cost calculation, timeliness, accuracy, and security assurance.

Data assets – leveraging lineage for full‑chain efficiency.

Data security – detecting sensitive data propagation.

8. Future Outlook

Plans include standardizing lineage capabilities, opening higher‑dimensional APIs for community contribution, achieving finer granularity (row‑level lineage), and further exploiting lineage for quality, efficiency, and security improvements.

The platform aims to evolve into a comprehensive data asset solution within the Volcano Engine VeDI ecosystem, extending its value both internally and to external B2B markets.

big datametadatadata lineageData Asset Management
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.