How Bilibili Builds a Scalable, Automated, and Intelligent Data Quality Platform
This article explains how Bilibili’s data quality team designs a process‑driven, automated, and AI‑enhanced platform that monitors billions of records daily, defines quality metrics such as completeness and consistency, integrates heterogeneous data sources, and provides root‑cause analysis and real‑time alerting to ensure trustworthy data for its massive user base.
Background
Data quality is a prerequisite for reliable big‑data applications. Bilibili processes over 30 w daily tasks on ~40 PB of data, consuming 6 trillion events per day, and therefore needs a platform that delivers real‑time, accurate, and trustworthy data.
Goals
The team aims to quickly discover and fix data quality issues through three sub‑goals: Process‑driven SOPs for repeatable operations, Automation to replace manual steps, and Intelligence to apply AI‑based anomaly detection.
Data Quality Theory
Quality issues are categorized into five dimensions: completeness, uniqueness, timeliness, accuracy, and consistency. These dimensions guide communication among stakeholders and help prioritize fixes.
Data Quality Model & Workflow
The platform abstracts a quality issue as a three‑stage workflow:
Recording : Capture quality features (e.g., row counts) from heterogeneous sources such as Hive, MySQL, Kafka, Iceberg, ClickHouse, Hudi.
Checking : Evaluate captured features against thresholds using a rich DSL (e.g., max(value) < 100 or custom ratio checks).
Alerting : Emit an exactly‑once notification (phone, SMS, email, WeChat) when a check fails, ensuring no duplicate or missed alerts.
Key diagrams illustrate the overall architecture and the three‑stage workflow.
Root‑Cause Analysis
After an alert, the platform builds a task‑level dependency graph, enriches it with metrics, logs, and events, and constructs a causal graph. A PageRank‑like algorithm ranks possible root causes. The process is illustrated with a graph diagram.
Scheduling Strategies
Two scheduling approaches are used:
Event‑driven : Trigger quality checks immediately after upstream jobs finish; if a check fails, downstream pipelines are blocked.
Time‑window‑driven : Apply to streaming data with frequent, low‑latency checks; the scheduler is designed for high availability and elasticity.
Heterogeneous Source Integration
The platform connects to various data sources via open‑source connectors and SPI‑based custom connectors, handling both batch (HDFS, MySQL) and analytical (OLAP) layers.
Rule Coverage and Automation
More than 150 built‑in rules cover common quality dimensions; users can also define custom rules to address evolving requirements.
Intelligent Monitoring
A time‑series‑based AI model decomposes signals into trend, seasonality, holidays, and noise, automatically setting dynamic thresholds and reducing false‑positive/negative alerts.
Alerting Practices
Alerts are delivered via multiple channels (phone, SMS, email, WeChat) and pushed to mobile devices for rapid response; contextual information is enriched to aid troubleshooting.
Full‑Link Monitoring
A global view shows critical paths, processing stages, and predicted completion times (95% accuracy for timeliness). Operators can see which tasks are blocked, the impact scope, and ongoing alerts.
Lessons Learned
Process first, tool second, intelligence last to achieve near‑zero‑touch operations.
Consider hidden long‑term maintenance costs when choosing solutions.
Address problems as they arise rather than postponing.
Future Plans
Continue advancing Intelligence by automatically classifying tables/columns and applying appropriate rules, enhance Real‑time quality checks for streaming workloads, and develop automated fault‑repair capabilities.
attach job(id=123) with modelrelease_event('update dimension...');
connect job(upstream).latency_event with job(downstream).latency_event;
connect job(upstream).completeness_event with job(downstream).consistency_event;
disconnect job(sparkversion=2.1).failed_event with spark(v=2.1).release_event;
connect job(sparkversion=3.1).failed_event with spark(v=3.1).release_event;Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
