Metadata Infrastructure and Governance in Bilibili Data Platform
Bilibili’s data platform consolidates scattered metadata into a unified URN‑based model stored across TiDB, Elasticsearch, and HugeGraph, offering batch‑pull and embedded collection, flexible SQL‑like queries, comprehensive lineage mapping, and powering data‑map, lineage‑map, and impact‑analysis tools while planning expanded quality assurance and self‑service dictionaries.
Shen Wangyang, a senior development engineer at Bilibili, is responsible for the data platform's metadata, data operation, and data management. The team focuses on metadata collection, lineage, data maps, modeling tools, and governance tools.
Background
Metadata is derivative data of the data platform, such as scheduling task information, offline Hive tables, real‑time topics, field definitions, storage details, quality metrics, and hotness indicators. In the early stage of the platform, this metadata was scattered across various subsystems (e.g., HiveMetaStore, scheduling DBs) and there was little demand for unified collection and management.
As the platform grew, the volume of tables and tasks increased, leading to higher data management and storage costs. New scenarios such as model governance, impact analysis, and duplicate construction emerged, requiring a unified metadata service for data discovery and governance.
Goals
The aim is to unify metadata through a single model, collection method, storage format, and query interface, thereby reducing custom development, improving flexibility, and lowering maintenance overhead.
System Overview
The architecture consists of metadata collection, unified URN‑based model, storage (TiDB for entities, Elasticsearch for search, HugeGraph for graph traversal), and query services.
Unified Metadata Model
The model satisfies three requirements: unified identification of resources, description of all resource types, and description of relationships among resources. It adopts a URN scheme: urn:datacenter:<resource_type>:<unique_id> . Sixteen resource types are defined; the most important is the table resource, identified by a three‑segment ID (source.database.table) and a four‑segment ID for fields.
Entity‑Relationship Model
An entity‑relationship diagram (shown in the original document) illustrates entities, aspects (to separate attributes from different systems), and builderURN for relationships, enabling lifecycle management of lineage built by tasks.
Metadata Collection
Three collection approaches are evaluated:
Batch pull (controlled, monitorable)
Batch push (simpler but less controllable)
Embedded reporting (real‑time, no storage constraints)
The team prefers batch pull for critical data and embedded reporting for non‑core data.
Business logic is maintained by the data source owners, ensuring a single conversion path to the unified model.
Quality assurance includes batch‑level checks and global fallback checks, with automated detection,定位, and remediation of issues.
Metadata Storage
TiDB stores entity and relationship data, Elasticsearch provides full‑text search and ranking, and HugeGraph enables deep graph traversal.
Metadata Query
Two generic query interfaces are provided: entity query and relationship query. A custom SQL parser translates user‑friendly SQL‑like conditions into engine‑specific DSLs.
{"page":1,"size":20,"where":"entity_type = 1 and sec_type = 3 and properties.tabName like '%r_ai.ods.recindexing.archive.test%'"}
{"page":1,"size":500,"where":"entity_type = 7","extraProperties":{"t1":"*:$.pgUrn.text_pageName","t2":"7:$.pgUrn.text_userName","t3":"7:$.pgUrn","t4":"*:$.pgUrn.bizCtime","t5":"*:$.dsUrn.sql","t6":"guanyuanCard:$.dsUrn.datasetStatus"}}
These queries support multi‑level association retrieval in a single request.
Lineage Construction
Lineage is a key focus, covering coverage, granularity, and accuracy. Coverage spans offline, real‑time, and ingestion pipelines. Granularity ranges from table‑level to field‑level (with three implementation options; the team adopts post‑execution dynamic parsing). Row‑level lineage is rare.
Applications
Metadata powers several products:
Data Map (search, classification, hotness recommendation)
Lineage Map (visual exploration of data lineage)
Impact Analysis (upstream/downstream impact detection, leveraging field‑level lineage and graph traversal)
These applications handle high query volumes (e.g., 2.5 W PV for generic queries, 4 K PV for data map searches).
Future Plans
Expand metadata quality assurance to more scenarios.
Build a comprehensive metadata dictionary for self‑service queries.
Establish data operation mechanisms to link supply‑side cost/production metrics with consumption‑side usage and impact.
Scale data governance using the existing metadata foundation.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.