Big Data 22 min read

Design and Optimization of Bilibili's Real-Time Data Quality Monitoring Platform

This article details the background, architecture, challenges, and iterative improvements of Bilibili's real-time data quality monitoring platform, covering offline and streaming DQC, resource-efficient Flink designs, InfluxDB proxy integration, CQ table handling, operational safeguards, and future engineering plans.

Big Data Technology & Architecture

Jun 21, 2023

Design and Optimization of Bilibili's Real-Time Data Quality Monitoring Platform

Background – Data quality is a prerequisite for reliable big‑data applications at Bilibili; the platform must provide real‑time, accurate data trusted by all business units.

Quality Platform Components – The DQC consists of data collection, inspection, and alerting, similar to traditional monitoring systems but tailored for big‑data workloads.

First‑Version Real‑Time DQC – Implemented with Flink, each rule created a separate Flink job writing results to MySQL, leading to low resource utilization, high network bandwidth consumption, and instability due to frequent restarts.

Design Goals for the New Solution

Start once without restarts to avoid resource contention.

Allow one task to serve multiple low‑traffic topics.

Perform multiple rule checks per consumption to reduce duplicate reads.

New Architecture – Introduces three topic categories (large, medium, small) and separates handling:

Small/medium topics: data is fully ingested into an InfluxDB full‑table, then aggregated via Continuous Query (CQ) into a CQ table.

Large topics: processed directly in Flink, aggregating results into the CQ table.

Dynamic topic and rule management is achieved through configuration‑center updates without restarting Flink jobs.

InfluxDB Proxy – A proxy service compresses and batches writes, performs dual‑write for consistency, retries failures, and selects optimal backend nodes for reads. Optimizations reduced request size and eliminated redundant tags, cutting network I/O by over 90%.

Data Model

Full‑table stores time, subtask, sinknum, tags, record_num, and field values.

Tags are chosen for low cardinality (e.g., platform), while high‑cardinality fields (e.g., mid) remain as fields.

CQ table holds time, rule_id, and aggregated value with a 14‑day TTL.

Optimizations for Data Inflation – Flattened output format changed to <groupKey, RuleIdList, Data> and a map‑reduce aggregation pipeline was added, significantly lowering the data‑inflation factor.

Operational Safeguards – Monitoring includes Flink back‑pressure, InfluxDB cluster health, sequence count, and handling of traffic spikes, crashes, and duplicate consumption. Automatic alerts trigger resource reallocation for high‑priority (P0) topics.

Future Work – Focus on further engineering automation, refined task‑topic placement, horizontal scaling of InfluxDB, and minimizing impact of quality monitoring on normal production workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Streaming Data Quality real-time monitoring InfluxDB

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.