Big Data 10 min read

How to Build a Real‑Time Data Quality Monitoring System with Flink

This article outlines a comprehensive approach to monitoring and ensuring the accuracy and timeliness of real‑time data streams, detailing background challenges, solution design, implementation steps using Flink and automated testing, alert handling procedures, and future improvement plans.

Youzan Coder
Youzan Coder
Youzan Coder
How to Build a Real‑Time Data Quality Monitoring System with Flink

1. Monitoring Background

Real‑time data must be accurate and timely for merchants to adjust strategies, improve conversion, and enhance user experience. While pre‑deployment testing catches many issues, production can still suffer from component upgrades, resource constraints, or hardware failures, requiring an online monitoring system that detects problems within minutes.

2. Monitoring Solution Overview

At Youzan, most real‑time metrics are calculated by Flink and stored in Druid or TiDB. Three monitoring strategies are proposed:

Real‑time task input‑output verification : Use Flink SQL local debugging to validate task logic before release.

Upstream‑downstream data comparison : Compare raw Kafka data (after binlog ingestion) with the aggregated result tables to cover the full data‑flow.

Yesterday‑real vs. yesterday‑offline comparison : After both real‑time and offline data are fully persisted, compare them via HTTP/Dubbo interfaces to verify consistency.

Each strategy addresses different dimensions of data quality (accuracy vs. timeliness) and has its own advantages, limitations, and trigger conditions.

3. Implementation

All three strategies are implemented on an automated interface testing platform, where assertions on API responses generate alerts on failure.

3.1 Flink Local Debugging

The platform supports three data‑validation methods: manual entry, file upload, and random Kafka reads. It can handle tasks with multiple sources and sinks. The workflow is:

Obtain the Kafka topic from the Flink job’s source configuration.

Retrieve a sample message from the topic via the platform’s Kafka manager.

Run the local debug with the sample message, capture the task’s output, and create an assertion case in the testing platform.

When the task changes, the platform can automatically re‑run the case.

3.2 Upstream‑Downstream Data Comparison

Raw binlog data is sent to Kafka, stored, and then aggregated according to metric definitions. The aggregated results are compared with the developer‑produced result tables. The comparison runs every 10 seconds for ten cycles; if all ten results differ, an alert is triggered, detecting accuracy or timeliness issues within roughly 100 seconds.

3.3 Yesterday‑Real vs. Yesterday‑Offline Comparison

Real‑time metrics are stored in Druid, offline metrics in Kylin. Both use HyperLogLog for approximate distinct counts. The comparison checks that values fall within an acceptable delta range, illustrated by a sample screenshot.

4. Alert Handling

When an alert fires, the troubleshooting workflow includes:

Inspecting and analyzing detailed data.

Replaying data to recover the correct state.

For data‑delay alerts, Kafka backlog is examined. Detailed data is persisted in a database and displayed via API calls, reducing the time needed to fetch raw records. Data‑replay scripts and documentation enable developers to restore real‑time data within about half an hour.

5. Effectiveness and Future Outlook

Currently, the upstream‑downstream comparison covers roughly 30% of core metrics (store‑level and product‑level). Coverage for traffic and add‑to‑cart metrics is pending. Many real‑time jobs still run as Flink JARs, limiting the applicability of local debugging; future work will migrate these to Flink SQL.

The online monitoring system fills a critical gap in real‑time data observability, providing early detection of quality issues, supporting fault‑drill exercises, and enabling rapid remediation.

Finally, the Youzan data testing team is hiring, with AI testing initiatives on the horizon.

Flinkstream processingData qualityalertingreal-time monitoring
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.