Big Data 21 min read

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.

Data Thinking Notes

Jan 10, 2023

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

01 Background

Data quality is a crucial prerequisite for effective big‑data applications. To incubate deeper, more competitive applications on big data and support Bilibili’s rapid growth, the data platform must deliver real‑time, accurate, trustworthy data.

The platform’s reliability depends on stable iteration of the entire data processing chain, including model quality, reliable scheduling, efficient resource allocation, and cooperative execution engines, all built on stable PaaS/IaaS services.

In short, trustworthy data quality is a core competitive edge of a big‑data platform, requiring coordinated effort across model teams, quality platforms, scheduling, compute engines, and storage/search services.

02 Goals

Rapidly detect and fix data quality issues.

With nearly 100 million daily active users relying on our data, we process over 300 k tasks daily, handling ~40 PB of data and 6 × 10¹³ events. To eliminate the risk of users using wrong data, we have iteratively achieved three sub‑goals:

Process – Define SOPs for high‑risk points so operators can follow exact steps.

Automation – Industrialize manual operations to reduce human error.

Intelligence – Apply smart methods to handle large‑scale, diverse scenarios.

03 Theory and Practice

Data quality issues involve many stakeholders, so we first concretize and categorize them.

Completeness – Are all rows present?

Uniqueness – Any duplicate records?

Timeliness – Is data available when needed?

Accuracy – Does data follow expected logic?

Consistency – Do different sources agree?

These categories guide communication and rapid implementation of quality goals.

3.1 Data Quality Theory

Typical user demand: “Alert me when today’s row count deviates more than 30 %.” We abstract such demands into a three‑stage workflow:

3.1.1 Data Quality Model

The architecture includes a workflow split into three stages:

Recording

Collect quality features (e.g., table row count) from heterogeneous sources such as Hive, MySQL, Kafka, Hudi, OLAP.

Features must be quantifiable and expressible in a SQL‑like language. Over 150 typical features are maintained in a standard library.

Supported data sources include:

Hive

MySQL

Iceberg

ClickHouse

Kafka

Hudi

For sources that cannot be directly connected, a push gateway allows clients to push quality metrics.

Checking

Evaluate whether a metric meets alert conditions (e.g., M<100). A rich DSL defines metrics such as max, min, average, and arithmetic expressions.

Custom DSLs support week‑over‑week, day‑over‑day, and multi‑day averages.

Alerting

When a condition is met, send an exactly‑once alert to the responsible party. “Exactly once” avoids both alert storms and missed alerts, and requires root‑cause analysis to provide actionable context.

Root Cause Analysis

We use graph‑based analysis (due to limited training data for ML). The process:

Construct a task‑level dependency graph.

Enrich it with metric, log, and event nodes to form an event‑causality graph.

Apply a PageRank‑like algorithm to rank probable root causes.

attach job(id=123) with modelrelease_event('update dimension...');
connect job(upstream).latency_event with job(downstream).latency_event;
connect job(upstream).completeness_event with job(downstream).consistency_event;
disconnect job(sparkversion=2.1).failed_event with spark(v=2.1).release_event;
connect job(sparkversion=3.1).failed_event with spark(v=3.1).release_event;

3.2 Data Quality Practices

3.2.1 Scheduling

Goal: trigger quality checks at the optimal moment. Early checks cause false alerts; late checks increase remediation cost. Two strategies:

Event‑driven: launch checks immediately after upstream job completion; if quality fails, downstream jobs are blocked.

Time‑window: used for near‑real‑time streaming, requiring high‑availability, low‑latency scheduling.

3.2.2 Heterogeneous Sources

We integrate diverse sources (HDFS, MySQL, OLAP) via open‑source connectors and custom SPI connectors, also handling DAL logical‑physical table mapping and read‑rate throttling.

3.2.3 Rule Coverage

With >30 k daily tasks, manual rule definition is infeasible. We provide 150+ built‑in rules and allow users to define custom rules for rapid iteration.

3.2.4 Intelligent Monitoring

Smart monitoring automatically determines anomalies without preset thresholds, decomposing time‑series into trend, seasonality, holidays, and noise.

Model insights include decreasing complaint trends, spikes around specific dates, and weekly patterns.

3.2.5 Alert Practices

Alerts are delivered via multiple channels (phone, SMS, email, WeChat) and pushed to mobile devices for rapid response, with on‑call teams notified through a duty system.

3.2.6 End‑to‑End Monitoring

We provide a full‑link view of the data processing pipeline, highlighting critical paths, progress, timeliness predictions (95 % accuracy), and current alerts.

04 Lessons Learned

Process first, tooling second, intelligence last for unattended operation.

Ensure true end‑to‑end link connectivity.

Leverage algorithms to auto‑learn data features and automate coverage.

Consider hidden long‑term maintenance costs when choosing solutions.

Adopt a “fix now” mindset to resolve current issues promptly.

05 Future Plans

We aim to enhance intelligent rule generation, achieve real‑time quality assurance for streaming‑batch convergence, and develop automated fault‑repair platforms to cover the entire incident lifecycle.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Big Data Automation Data Quality Scheduling Root Cause Analysis intelligent alerting

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.