Big Data 21 min read

Metadata Infrastructure and Governance in Bilibili Data Platform

Bilibili’s data platform consolidates scattered metadata into a unified URN‑based model stored across TiDB, Elasticsearch, and HugeGraph, offering batch‑pull and embedded collection, flexible SQL‑like queries, comprehensive lineage mapping, and powering data‑map, lineage‑map, and impact‑analysis tools while planning expanded quality assurance and self‑service dictionaries.

Bilibili Tech

May 24, 2022

Metadata Infrastructure and Governance in Bilibili Data Platform

Shen Wangyang, a senior development engineer at Bilibili, is responsible for the data platform's metadata, data operation, and data management. The team focuses on metadata collection, lineage, data maps, modeling tools, and governance tools.

Background

Metadata is derivative data of the data platform, such as scheduling task information, offline Hive tables, real‑time topics, field definitions, storage details, quality metrics, and hotness indicators. In the early stage of the platform, this metadata was scattered across various subsystems (e.g., HiveMetaStore, scheduling DBs) and there was little demand for unified collection and management.

As the platform grew, the volume of tables and tasks increased, leading to higher data management and storage costs. New scenarios such as model governance, impact analysis, and duplicate construction emerged, requiring a unified metadata service for data discovery and governance.

Goals

The aim is to unify metadata through a single model, collection method, storage format, and query interface, thereby reducing custom development, improving flexibility, and lowering maintenance overhead.

System Overview

The architecture consists of metadata collection, unified URN‑based model, storage (TiDB for entities, Elasticsearch for search, HugeGraph for graph traversal), and query services.

Unified Metadata Model

The model satisfies three requirements: unified identification of resources, description of all resource types, and description of relationships among resources. It adopts a URN scheme: urn:datacenter:<resource_type>:<unique_id>. Sixteen resource types are defined; the most important is the table resource, identified by a three‑segment ID (source.database.table) and a four‑segment ID for fields.

Entity‑Relationship Model

An entity‑relationship diagram (shown in the original document) illustrates entities, aspects (to separate attributes from different systems), and builderURN for relationships, enabling lifecycle management of lineage built by tasks.

Metadata Collection

Three collection approaches are evaluated:

Batch pull (controlled, monitorable)

Batch push (simpler but less controllable)

Embedded reporting (real‑time, no storage constraints)

The team prefers batch pull for critical data and embedded reporting for non‑core data.

Business logic is maintained by the data source owners, ensuring a single conversion path to the unified model.

Quality assurance includes batch‑level checks and global fallback checks, with automated detection,定位, and remediation of issues.

Metadata Storage

TiDB stores entity and relationship data, Elasticsearch provides full‑text search and ranking, and HugeGraph enables deep graph traversal.

Metadata Query

Two generic query interfaces are provided: entity query and relationship query. A custom SQL parser translates user‑friendly SQL‑like conditions into engine‑specific DSLs.

{"page":1,"size":20,"where":"entity_type = 1 and sec_type = 3 and properties.tabName like '%r_ai.ods.recindexing.archive.test%'"}

{"page":1,"size":500,"where":"entity_type = 7","extraProperties":{"t1":"*:$.pgUrn.text_pageName","t2":"7:$.pgUrn.text_userName","t3":"7:$.pgUrn","t4":"*:$.pgUrn.bizCtime","t5":"*:$.dsUrn.sql","t6":"guanyuanCard:$.dsUrn.datasetStatus"}}

These queries support multi‑level association retrieval in a single request.

Lineage Construction

Lineage is a key focus, covering coverage, granularity, and accuracy. Coverage spans offline, real‑time, and ingestion pipelines. Granularity ranges from table‑level to field‑level (with three implementation options; the team adopts post‑execution dynamic parsing). Row‑level lineage is rare.

Applications

Metadata powers several products:

Data Map (search, classification, hotness recommendation)

Lineage Map (visual exploration of data lineage)

Impact Analysis (upstream/downstream impact detection, leveraging field‑level lineage and graph traversal)

These applications handle high query volumes (e.g., 2.5 W PV for generic queries, 4 K PV for data map searches).

Future Plans

Expand metadata quality assurance to more scenarios.

Build a comprehensive metadata dictionary for self‑service queries.

Establish data operation mechanisms to link supply‑side cost/production metrics with consumption‑side usage and impact.

Scale data governance using the existing metadata foundation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Metadata Data Platform data lineage SQL parsing metadata model Data Governance

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.