Design and Implementation of a Big Data Metadata Warehouse at Bilibili
This article presents Bilibili's big‑data metadata warehouse, covering its background, technology selection between data‑lake and data‑warehouse solutions, the architecture built on Prometheus, StarRocks, Flink and Routine Load, performance comparisons, diagnostic system design, and future development plans.
The presentation introduces the background of a big‑data metadata warehouse (元仓) at Bilibili, explaining the need for runtime metrics and job‑level statistics to monitor and govern large‑scale data processing components such as Yarn, Presto, and Spark.
It describes the technical selection process, evaluating data‑lake technologies (Iceberg, Hudi, Delta Lake) and data‑warehouse options (StarRocks, ClickHouse), ultimately choosing StarRocks for its SQL compatibility, performance, and lower operational overhead.
The monitoring architecture is built on Prometheus with three layers: data source exposure via exporters, data collection and storage via HTTP pull into a time‑series database, and a data‑service layer that queries metrics using PromQL, sets alert thresholds, and integrates with Kafka for downstream processing.
Data ingestion into the metadata warehouse uses both Routine Load (StarRocks’ native Kafka consumer) for simple streams and Flink for complex ETL, with a Stream Load approach that batches data in memory and writes asynchronously to reduce GC pressure.
A diagnostic system, inspired by OPPO’s Compass project, consumes Kafka events, applies rules such as large table scans, global sort anomalies, and data skew detection, and provides actionable suggestions through a data‑intelligence service.
Future work includes expanding StarRocks usage to BI and DQC, integrating with Ranger for unified permission management, extending the metadata scope to additional components (HDFS, Kyuubi), enabling lake‑warehouse convergence, and broadening diagnostic coverage to Presto and Flink jobs.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.