Big Data 20 min read

Design and Implementation of a Big Data Metadata Warehouse at Bilibili

This article presents Bilibili's big‑data metadata warehouse, covering its background, technology selection between data‑lake and data‑warehouse solutions, the architecture built on Prometheus, StarRocks, Flink and Routine Load, performance comparisons, diagnostic system design, and future development plans.

DataFunSummit
DataFunSummit
DataFunSummit
Design and Implementation of a Big Data Metadata Warehouse at Bilibili

The presentation introduces the background of a big‑data metadata warehouse (元仓) at Bilibili, explaining the need for runtime metrics and job‑level statistics to monitor and govern large‑scale data processing components such as Yarn, Presto, and Spark.

It describes the technical selection process, evaluating data‑lake technologies (Iceberg, Hudi, Delta Lake) and data‑warehouse options (StarRocks, ClickHouse), ultimately choosing StarRocks for its SQL compatibility, performance, and lower operational overhead.

The monitoring architecture is built on Prometheus with three layers: data source exposure via exporters, data collection and storage via HTTP pull into a time‑series database, and a data‑service layer that queries metrics using PromQL, sets alert thresholds, and integrates with Kafka for downstream processing.

Data ingestion into the metadata warehouse uses both Routine Load (StarRocks’ native Kafka consumer) for simple streams and Flink for complex ETL, with a Stream Load approach that batches data in memory and writes asynchronously to reduce GC pressure.

A diagnostic system, inspired by OPPO’s Compass project, consumes Kafka events, applies rules such as large table scans, global sort anomalies, and data skew detection, and provides actionable suggestions through a data‑intelligence service.

Future work includes expanding StarRocks usage to BI and DQC, integrating with Ranger for unified permission management, extending the metadata scope to additional components (HDFS, Kyuubi), enabling lake‑warehouse convergence, and broadening diagnostic coverage to Presto and Flink jobs.

MonitoringBig DataFlinkStarRocksDiagnosticsmetadata warehouse
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.