How Bilibili Built a Scalable Data Quality Platform for Billions of Events
This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.
01 Background
Data quality is a crucial prerequisite for effective big‑data applications. To incubate deeper, more competitive applications on big data and support Bilibili’s rapid growth, the data platform must deliver real‑time, accurate, trustworthy data.
The platform’s reliability depends on stable iteration of the entire data processing chain, including model quality, reliable scheduling, efficient resource allocation, and cooperative execution engines, all built on stable PaaS/IaaS services.
In short, trustworthy data quality is a core competitive edge of a big‑data platform, requiring coordinated effort across model teams, quality platforms, scheduling, compute engines, and storage/search services.
02 Goals
Rapidly detect and fix data quality issues.
With nearly 100 million daily active users relying on our data, we process over 300 k tasks daily, handling ~40 PB of data and 6 × 10¹³ events. To eliminate the risk of users using wrong data, we have iteratively achieved three sub‑goals:
Process – Define SOPs for high‑risk points so operators can follow exact steps.
Automation – Industrialize manual operations to reduce human error.
Intelligence – Apply smart methods to handle large‑scale, diverse scenarios.
03 Theory and Practice
Data quality issues involve many stakeholders, so we first concretize and categorize them.
Completeness – Are all rows present?
Uniqueness – Any duplicate records?
Timeliness – Is data available when needed?
Accuracy – Does data follow expected logic?
Consistency – Do different sources agree?
These categories guide communication and rapid implementation of quality goals.
3.1 Data Quality Theory
Typical user demand: “Alert me when today’s row count deviates more than 30 %.” We abstract such demands into a three‑stage workflow:
3.1.1 Data Quality Model
The architecture includes a workflow split into three stages:
Recording
Collect quality features (e.g., table row count) from heterogeneous sources such as Hive, MySQL, Kafka, Hudi, OLAP.
Features must be quantifiable and expressible in a SQL‑like language. Over 150 typical features are maintained in a standard library.
Supported data sources include:
Hive
MySQL
Iceberg
ClickHouse
Kafka
Hudi
For sources that cannot be directly connected, a push gateway allows clients to push quality metrics.
Checking
Evaluate whether a metric meets alert conditions (e.g., M<100). A rich DSL defines metrics such as max, min, average, and arithmetic expressions.
Custom DSLs support week‑over‑week, day‑over‑day, and multi‑day averages.
Alerting
When a condition is met, send an exactly‑once alert to the responsible party. “Exactly once” avoids both alert storms and missed alerts, and requires root‑cause analysis to provide actionable context.
Root Cause Analysis
We use graph‑based analysis (due to limited training data for ML). The process:
Construct a task‑level dependency graph.
Enrich it with metric, log, and event nodes to form an event‑causality graph.
Apply a PageRank‑like algorithm to rank probable root causes.
<code>attach job(id=123) with modelrelease_event('update dimension...');
connect job(upstream).latency_event with job(downstream).latency_event;
connect job(upstream).completeness_event with job(downstream).consistency_event;
disconnect job(sparkversion=2.1).failed_event with spark(v=2.1).release_event;
connect job(sparkversion=3.1).failed_event with spark(v=3.1).release_event;</code>3.2 Data Quality Practices
3.2.1 Scheduling
Goal: trigger quality checks at the optimal moment. Early checks cause false alerts; late checks increase remediation cost. Two strategies:
Event‑driven: launch checks immediately after upstream job completion; if quality fails, downstream jobs are blocked.
Time‑window: used for near‑real‑time streaming, requiring high‑availability, low‑latency scheduling.
3.2.2 Heterogeneous Sources
We integrate diverse sources (HDFS, MySQL, OLAP) via open‑source connectors and custom SPI connectors, also handling DAL logical‑physical table mapping and read‑rate throttling.
3.2.3 Rule Coverage
With >30 k daily tasks, manual rule definition is infeasible. We provide 150+ built‑in rules and allow users to define custom rules for rapid iteration.
3.2.4 Intelligent Monitoring
Smart monitoring automatically determines anomalies without preset thresholds, decomposing time‑series into trend, seasonality, holidays, and noise.
Model insights include decreasing complaint trends, spikes around specific dates, and weekly patterns.
3.2.5 Alert Practices
Alerts are delivered via multiple channels (phone, SMS, email, WeChat) and pushed to mobile devices for rapid response, with on‑call teams notified through a duty system.
3.2.6 End‑to‑End Monitoring
We provide a full‑link view of the data processing pipeline, highlighting critical paths, progress, timeliness predictions (95 % accuracy), and current alerts.
04 Lessons Learned
Process first, tooling second, intelligence last for unattended operation.
Ensure true end‑to‑end link connectivity.
Leverage algorithms to auto‑learn data features and automate coverage.
Consider hidden long‑term maintenance costs when choosing solutions.
Adopt a “fix now” mindset to resolve current issues promptly.
05 Future Plans
We aim to enhance intelligent rule generation, achieve real‑time quality assurance for streaming‑batch convergence, and develop automated fault‑repair platforms to cover the entire incident lifecycle.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.