Big Data 19 min read

How Bilibili Builds a Scalable, Automated, and Intelligent Data Quality Platform

This article explains how Bilibili’s data quality team designs a process‑driven, automated, and AI‑enhanced platform that monitors billions of records daily, defines quality metrics such as completeness and consistency, integrates heterogeneous data sources, and provides root‑cause analysis and real‑time alerting to ensure trustworthy data for its massive user base.

ITPUB

Nov 5, 2022

How Bilibili Builds a Scalable, Automated, and Intelligent Data Quality Platform

Background

Data quality is a prerequisite for reliable big‑data applications. Bilibili processes over 30 w daily tasks on ~40 PB of data, consuming 6 trillion events per day, and therefore needs a platform that delivers real‑time, accurate, and trustworthy data.

Goals

The team aims to quickly discover and fix data quality issues through three sub‑goals: Process‑driven SOPs for repeatable operations, Automation to replace manual steps, and Intelligence to apply AI‑based anomaly detection.

Data Quality Theory

Quality issues are categorized into five dimensions: completeness, uniqueness, timeliness, accuracy, and consistency. These dimensions guide communication among stakeholders and help prioritize fixes.

Data Quality Model & Workflow

The platform abstracts a quality issue as a three‑stage workflow:

Recording : Capture quality features (e.g., row counts) from heterogeneous sources such as Hive, MySQL, Kafka, Iceberg, ClickHouse, Hudi.

Checking : Evaluate captured features against thresholds using a rich DSL (e.g., max(value) < 100 or custom ratio checks).

Alerting : Emit an exactly‑once notification (phone, SMS, email, WeChat) when a check fails, ensuring no duplicate or missed alerts.

Key diagrams illustrate the overall architecture and the three‑stage workflow.

Root‑Cause Analysis

After an alert, the platform builds a task‑level dependency graph, enriches it with metrics, logs, and events, and constructs a causal graph. A PageRank‑like algorithm ranks possible root causes. The process is illustrated with a graph diagram.

Scheduling Strategies

Two scheduling approaches are used:

Event‑driven : Trigger quality checks immediately after upstream jobs finish; if a check fails, downstream pipelines are blocked.

Time‑window‑driven : Apply to streaming data with frequent, low‑latency checks; the scheduler is designed for high availability and elasticity.

Heterogeneous Source Integration

The platform connects to various data sources via open‑source connectors and SPI‑based custom connectors, handling both batch (HDFS, MySQL) and analytical (OLAP) layers.

Rule Coverage and Automation

More than 150 built‑in rules cover common quality dimensions; users can also define custom rules to address evolving requirements.

Intelligent Monitoring

A time‑series‑based AI model decomposes signals into trend, seasonality, holidays, and noise, automatically setting dynamic thresholds and reducing false‑positive/negative alerts.

Alerting Practices

Alerts are delivered via multiple channels (phone, SMS, email, WeChat) and pushed to mobile devices for rapid response; contextual information is enriched to aid troubleshooting.

Full‑Link Monitoring

A global view shows critical paths, processing stages, and predicted completion times (95% accuracy for timeliness). Operators can see which tasks are blocked, the impact scope, and ongoing alerts.

Lessons Learned

Process first, tool second, intelligence last to achieve near‑zero‑touch operations.

Consider hidden long‑term maintenance costs when choosing solutions.

Address problems as they arise rather than postponing.

Future Plans

Continue advancing Intelligence by automatically classifying tables/columns and applying appropriate rules, enhance Real‑time quality checks for streaming workloads, and develop automated fault‑repair capabilities.

attach job(id=123) with modelrelease_event('update dimension...');
connect job(upstream).latency_event with job(downstream).latency_event;
connect job(upstream).completeness_event with job(downstream).consistency_event;
disconnect job(sparkversion=2.1).failed_event with spark(v=2.1).release_event;
connect job(sparkversion=3.1).failed_event with spark(v=3.1).release_event;

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Quality Scheduling Root Cause Analysis intelligent alerts

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.