Operations 27 min read

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

This article details Bilibili's data quality assurance framework, covering its evolution across four data platform stages, the architecture of its quality data warehouse, core capabilities such as a complete assurance system, digital‑driven continuous optimization, and efficient incident handling, plus case studies, future plans, and a Q&A session.

Data Thinking Notes

Nov 2, 2023

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

Introduction

This article shares Bilibili's data quality assurance system construction and practice, focusing on data warehouse and modeling methodologies, explaining the work of the Bilibili data warehouse platform team during warehouse building and modeling, and presenting the results achieved in quality assurance.

Background and Goals

Architecture

Case Study

Future Outlook

Q&A

Background and Goals

Bilibili's data quality assurance background and objectives are introduced first.

The history of Bilibili's data construction can be divided into four stages:

Database stage : Early startup phase; focus on test case design, data correctness verification, and database monitoring/tuning.

Data warehouse stage : Business growth drives OLAP needs; emphasis on data completeness, accuracy, consistency, and timeliness.

Data platform stage : Massive data volume leads to adoption of Hadoop and open‑source components; focus on architecture quality, link availability, and diverse processing pipelines.

Mid‑platform stage : Serves previous stages while addressing diverse business needs and data intelligence; inherits previous quality measures and expands discussion.

Understanding these stages is essential for continuous data quality improvement.

The current Bilibili data architecture consists of four layers from bottom to top: data sources, data platform, data mid‑platform, and data applications.

Data sources include account systems, event‑tracking systems, CRM, and third‑party services, feeding continuous data into the warehouse via the platform with both batch and real‑time capabilities.

In some scenarios, a centralized data hub is built, then split into thematic domains such as user, transaction, content, and community.

Data applications are divided into PC and mobile dashboards, focusing on growth, operation, and content metrics. The data pipeline has become complex, requiring multi‑layer, multi‑component, and cross‑team collaboration for quality assurance.

Feedback from partner teams highlights common problems, confirming rising demand for data quality assurance as business expands.

Dashboard pages sometimes fail to display data transparently, affecting user experience.

Analysts may see zero values for certain metrics, questioning data reliability.

Developers receive frequent night‑time alarm calls, disrupting normal work.

From these observations, three core pain points are abstracted:

Data consumers need timely, accurate, and trustworthy data; they expect rapid recovery when incidents occur.

Data producers must prioritize critical data based on user demand and provide differentiated assurance for long‑tail data.

Pipeline owners require clear assurance requirements and response plans for extreme cases.

The overall goal is to continuously improve data quality, reduce incident correction cost, lower data‑use risk, and increase business service satisfaction.

Architecture

Based on the goals, a quality data warehouse is built as the foundation, establishing three core capabilities: a complete quality assurance system, digital‑driven continuous optimization, and efficient incident handling.

1. Quality Data Warehouse Construction

Relevant assurance services are introduced, unifying data warehouse construction and leveraging the mid‑platform to quickly build warehouse architecture.

After the warehouse is built, quality data guide issue description, support decision‑making, and enable daily detection and analysis to eliminate pre‑emptive problems.

The architecture has four layers: data sources, warehouse construction, analysis project construction, and final applications. Data source layer includes alarm services, baseline services, DQC services, lineage data, event management, and on‑call systems, all ingested into the warehouse and layered.

Three layers within the warehouse – detail, summary, and high‑level summary – abstract quality metrics such as exception lists and alarms, providing dashboards like quality operation, real‑time assurance, and alarm attribution.

2. Complete Quality Assurance System

The first core capability ensures data meets user requirements, with all parties responsible for quality, monitoring standards, rule libraries, and sustainable improvement plans.

It is broken into three parts:

Monitoring system : Data asset grading triggers checkpoint validation; quality score mechanisms measure effectiveness; rules cover completeness, consistency, validity, timeliness, and extend to tracking, integration, processing, assembly, export, and API services. An incident attribution knowledge base links alarms to root causes.

Cross‑team collaborative assurance : Coordination with upstream and downstream teams to define SLA standards and create cross‑team mechanisms.

Daily operation : Night‑time on‑call processes include urgent follow‑up, cause定位, data recovery, and impact notification. The team triggers alarm monitoring, takes immediate mitigation, notifies stakeholders, hands over to responsible teams, and archives the incident after recovery.

3. Digital‑Driven Continuous Optimization

The second capability builds measurement indicators, analyzes current state, discovers problems from data, proposes solutions, and continuously tracks optimization impact, using digital metrics to drive quality assurance.

4. Efficient Incident Handling

The third capability focuses on rapid response, minimizing impact, and achieving quick recovery to keep users unaware of issues.

From a data development perspective, tools such as mechanical risk diagnosis, alarm optimization, fault recovery, and rule configuration are provided. From the service side, one‑click recovery links, hierarchical full‑link assurance, and unified on‑call mechanisms are built.

Case Study

A practical Bilibili case is presented.

Typical data development flow includes task deployment, daily batch monitoring, and possible alarm triggering. When problems occur, response and data recovery are performed, and issues are archived.

Development stage: Over 5,000 core tasks exist, but monitoring coverage is below 50%; approval rules and release processes are incomplete.

On‑call stage: SOP for on‑call is imperfect, response efficiency is low, night‑time fault sync is lacking, and night‑time on‑call rate reaches ~50% (3‑4 nights per week), leading to high human cost.

Review stage: Many issues are not caused by warehouse operations; recurring assurance problems are identified.

Key problems summarized:

Long data links with many components make assurance difficult.

Lack of clear metrics to evaluate assurance effectiveness.

Insufficient standardized mechanisms; coordination across data links is needed.

Four major assurance goals are defined:

Accurately identify core scenarios and support digital measurement.

Ensure data meets four basic principles: completeness, accuracy, consistency, timeliness, and satisfy customized user needs.

Guarantee quality throughout the entire data lifecycle (pre‑, during, post‑processing).

Summarize methodology and advance tool capabilities for prevention, response, processing, recovery, and review.

Based on the goals, three core capabilities are further detailed:

Quality data warehouse construction and layered data ingestion.

Monitoring system with checkpoint validation, quality scoring, and rule libraries.

Cross‑team collaboration and daily operation processes.

Metrics such as quality score (100 points) are broken into dimensions like completeness, consistency, accuracy, alarm response, coverage, job stability, timeliness, and link assurance rate.

Continuous improvement results: incident count dropped 50% over three quarters, capture rate near 100%, night‑time on‑call days reduced 55%, baseline breakage decreased, night‑time incidents fell 59%, and night‑time duration decreased 86%.

Future Outlook

Future work will focus on two directions:

Continuously expand assurance coverage, enrich strategies, and drive data‑driven improvements while maintaining control over existing assets.

Combine theory with practice to iterate tool capabilities and improve communication mechanisms.

Long‑term, the aim is to move from manual operations to information‑based, and eventually to intelligent, fully‑described, measurable, and easy‑to‑operate quality assurance.

Q&A

Q: Data quality scores consist of eight rules, but each table has different rules, counts, and importance. How to align all tables?

A: Bilibili uses five grading levels, focusing on online data such as BOSS dashboards and company‑level analysis products. For each level, specific standards are set; high‑priority scenarios receive comprehensive rules (uniqueness, row count, upstream tracking consistency, timeliness), while lower levels receive basic rules.

Q: A real‑time task runs on a Flink platform. How to evaluate overall data quality across different platforms?

A: The approach synchronizes raw data from the e‑commerce platform into the quality warehouse, standardizes formats, and reuses the same pipeline to generate quality scores and to‑do items, enabling unified assessment.

Q: What are the most common problems during post‑mortems? Any acceleration tools?

A: Early stages suffer from alarm explosion; the focus is on reducing noise while keeping effective rules. A knowledge base for incidents is being built to distribute alarms to relevant teams, easing on‑call pressure and speeding up root‑cause analysis.

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Big Data Data Quality Data Platform Data Warehouse incident response Bilibili

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Table of Contents

Background and Goals

Architecture

1. Quality Data Warehouse Construction

2. Complete Quality Assurance System

3. Digital‑Driven Continuous Optimization

4. Efficient Incident Handling

Case Study

Future Outlook

Q&A

Data Thinking Notes

How this landed with the community

Was this worth your time?

0 Comments