Big Data 20 min read

Design and Implementation of a Data Quality Platform for Large-Scale Data Processing

Bilibili built a scalable data‑quality platform that records metrics from heterogeneous sources, checks them with a rich DSL, alerts once with root‑cause analysis, and uses event‑driven and time‑window scheduling, automated workflows, and intelligent monitoring to ensure real‑time, accurate, trustworthy data for petabyte‑scale processing.

Bilibili Tech

Nov 1, 2022

Design and Implementation of a Data Quality Platform for Large-Scale Data Processing

Background

Data quality is a fundamental prerequisite for the effectiveness of big‑data applications. Bilibili’s data platform must provide real‑time, accurate, and trustworthy data to support its rapidly growing services, which serve hundreds of millions of daily active users and process tens of petabytes of data each day.

The platform’s reliability depends on stable data models, robust scheduling, efficient resource allocation, and seamless collaboration among data model teams, quality platforms, scheduling teams, compute engines, and storage/search services.

Goals

The team aims to quickly discover and fix data quality issues through three sub‑goals: process‑driven operations, automation, and intelligence.

Process‑driven – Define SOPs for high‑risk scenarios to guide operators and monitoring.

Automation – Industrialize manual operations using automated workflows.

Intelligence – Apply intelligent methods to handle large‑scale, diverse data and reduce reliance on manual experience.

Data Quality Theory

The quality dimensions are defined as:

Completeness – whether all expected rows are present.

Uniqueness – detection of duplicate records at full‑load or incremental granularity.

Timeliness – freshness of data for both real‑time and batch use cases.

Accuracy – logical correctness of values (e.g., ID numbers, age consistency).

Consistency – cross‑source agreement (e.g., sum(A.b where A.c='Shanghai') = sum(B.b)).

Data Quality Model

The workflow is split into three stages:

Recording – Capture quality features (e.g., table row counts) from heterogeneous sources such as Hive, MySQL, Kafka, Hudi, Iceberg, ClickHouse, etc. Features must be quantifiable and expressible in a SQL‑like language.

Checking – Evaluate whether a metric satisfies alert conditions (e.g., M<100). A rich DSL supports operators like max, min, avg, arithmetic, and composite expressions (e.g., ratio of two metrics < 30%).

Alerting – Deliver alerts exactly once to the responsible parties, avoiding duplicate or missed notifications. The system integrates root‑cause analysis to provide actionable context.

Supported Data Sources

The platform currently connects to Hive, MySQL, Iceberg, ClickHouse, Kafka, and Hudi, leveraging open‑source connectors and SPI extensions for additional sources. When direct connectivity is impossible, a push‑gateway allows clients to push quality metrics.

Root Cause Analysis Pipeline

1. Build a task‑level dependency graph from job traces.

2. Enrich the graph with metrics, log events, and deployment records to form an event‑causality graph.

3. Apply a weighted PageRank‑like algorithm to rank potential root causes.

The event model includes fields such as type, event_time, domain, owner, resource_uri, and properties.

attach job(id=123) with modelrelease_event('update dimension...');

connect job(upstream).latency_event with job(downstream).latency_event;

connect job(upstream).completeness_event with job(downstream).consistency_event;

disconnect job(sparkversion=2.1).failed_event with spark(v=2.1).release_event;

connect job(sparkversion=3.1).failed_event with spark(v=3.1).release_event;

Scheduling Strategies

Two strategies are employed:

Event‑driven – Trigger quality checks immediately after a data production job finishes, with optional circuit‑breakers to halt downstream processing if quality fails.

Time‑window – Used for near‑real‑time streaming data, requiring a highly available, low‑latency scheduler with elastic resource allocation.

The scheduler isolates different request classes into separate queues and can degrade gracefully under load.

Heterogeneous Source Handling

Connectors and SPI extensions enable access to diverse catalogs (e.g., MySQL, OLAP). Integration with traditional DAL layers and read‑rate throttling are also considered.

Rule Coverage

More than 150 built‑in rules cover common quality checks. Users can define custom rules to meet evolving requirements.

Intelligent Monitoring

A time‑series‑based intelligent monitoring system automatically determines anomaly thresholds without manual tuning, decomposing signals into trend, seasonality, holidays, and noise.

Examples show decreasing complaint trends, spikes around specific dates, and weekly patterns, demonstrating the system’s ability to surface actionable insights.

Alerting Practices

Alerts are delivered via multiple channels (phone, SMS, email, WeChat) and pushed to mobile devices for rapid response. Integration with on‑call rotation systems ensures timely notification.

Full‑Link Monitoring

A global view of the entire data processing pipeline highlights critical paths, predicts completion times, and visualizes active alerts, enabling operators to assess impact and prioritize remediation.

Experience and Lessons

Process first, tooling second, intelligence last; end‑to‑end link tracing, algorithmic feature learning, and cost‑aware design are essential for sustainable operations.

Future Plans

Continue advancing intelligent rule generation, real‑time quality solutions, and automated fault‑repair platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Automation workflow Data Quality Root Cause Analysis

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.