Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices
This article presents a detailed overview of Douyin Group's Data Asset Management Platform, focusing on the evolution, architecture, modeling, metrics, and application scenarios of its large‑scale data lineage system, and outlines future directions for full‑coverage, fine‑grained lineage capabilities.
The article introduces Douyin Group's one‑stop Data Asset Management Platform, emphasizing the shift from traditional metadata to a broader concept of data assets to better serve precise data discovery needs across complex business scenarios.
The platform supports diverse data sources, collects metadata into a unified lake, and enriches assets through active metadata, evaluation, and AI‑driven search, enabling portal, recommendation, and other product capabilities.
Four main topics are covered:
Overall introduction of Douyin Group's data lineage.
Architecture of the lineage system.
Application scenarios of lineage.
Future outlook.
1. Data Lineage Overview
Goals include building full‑coverage, real‑time, accurate lineage to empower various scenarios; lineage is seen as the core of metadata, essential for efficient data platforms.
Key motivations: link visibility, quality assurance, security, and cost reduction.
Lineage covers three categories: source/ingestion lineage, production chain lineage (real‑time and offline warehouses), and application‑side lineage, with both table‑level and field‑level granularity.
2. Lineage Model Abstraction
Two modeling approaches are discussed: a static node‑edge model (fast reads, slow updates) and a dynamic task‑centric model (fast updates, slower reads). A generalized model introduces three entity types—DataStore (e.g., Hive table), Column, and Process (task)—and six relationship types, balancing storage and query efficiency.
3. Lineage Quality Metrics
Three primary indicators form a "Lineage Quality Score": coverage (tasks parsed), accuracy (correctness of parsed lineage), and completeness (full coverage of relationships).
4. Lineage Ecosystem
The ecosystem includes data source ingestion, metadata collection, storage (graph databases like JanusGraph, Neo4j, NebulaGraph), and analysis services for both real‑time and offline use cases.
5. Unified Parsing Service
Parsing is critical; the solution combines ANTLR (lexical/grammar parsing) and Calcite (SQL optimization) to handle multiple dialects and complex scripts, converting parse trees into SQLNode/RelNode for lineage extraction.
6. Lineage Access Services
Production lineage: extracts table‑to‑table and column‑to‑column dependencies, including operator‑level lineage.
Cross‑region lineage: aggregates local lineage and propagates it across regions via a message bus, handling global tables via catalog lookup.
Application lineage: captures end‑to‑end relationships from low‑code platforms, custom services, HTTP/RPC calls, and trace logs, addressing challenges of log volume and accuracy.
7. Application Scenarios
Four major domains are highlighted:
Data development – impact assessment, field‑level debugging, rapid task testing, upstream change alerts, and model migration.
Data governance – low‑value asset identification, cost calculation, timeliness, accuracy, and security assurance.
Data assets – leveraging lineage for full‑chain efficiency.
Data security – detecting sensitive data propagation.
8. Future Outlook
Plans include standardizing lineage capabilities, opening higher‑dimensional APIs for community contribution, achieving finer granularity (row‑level lineage), and further exploiting lineage for quality, efficiency, and security improvements.
The platform aims to evolve into a comprehensive data asset solution within the Volcano Engine VeDI ecosystem, extending its value both internally and to external B2B markets.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
