Online Monitoring Practices for Offline and Real-Time Data at Youzan
Youzan Data Report Center monitors offline batch and real‑time data pipelines using accuracy and timeliness rules, cross‑table checks, upstream‑downstream comparisons, and scheduled alerts to detect anomalies early; since 2021 it has generated over 25 alerts, and plans a unified data‑quality dashboard.
Youzan Data Report Center provides multi-dimensional, multi-channel, and multi-period data to help merchants operate their stores more scientifically. The article shares Youzan's online monitoring practices for both offline (batch) and real-time (stream) data.
Monitoring Background & Problems Solved
Because merchant operations rely heavily on timely and accurate data, any issues such as missing tables, component failures, or data anomalies can severely impact business decisions. Online monitoring is therefore essential for early detection and interception of problems. The monitoring covers offline data generation (typically completed before 7 am) and real‑time data pipelines.
1. Youzan Data Flow Diagram
The data originates from transaction, product, customer, and client‑side logs, then passes through both offline (HiveSQL) and real‑time (Flink) processing layers before being stored in downstream stores such as Druid and TiDB.
2. Offline Data (Batch) Monitoring Details
Offline data refers to statistics for yesterday and earlier periods, with dimensions such as day, week, month, last 7 days, last 30 days, quarter, and custom ranges. Monitoring focuses on accuracy and timeliness.
2.1 Accuracy Rules
Cross‑table comparison: ensure the same metric across different source tables yields consistent values.
Intra‑table logical checks: e.g., payment count ≤ order count.
Self‑checks: enforce enumeration, uniqueness, and non‑null constraints.
Table‑level checks: row‑count and size comparisons against historical trends.
Application‑level checks: verify that API responses remain unchanged when definitions are stable.
2.2 Timeliness Rules
Start scheduling time: when a job enters the queue.
Execution duration: time from job start to finish, affected by priority, engine, and SQL efficiency.
Deadline: maximum allowed execution time after scheduling.
Rule validation time: time taken by alert rules (up to 8 minutes) before downstream scheduling proceeds.
These factors guide the definition of monitoring thresholds and alert strategies (e.g., P3+ priority jobs, deadline alerts, and API value > 0 checks).
2.3 Implementation Example
The following Java code demonstrates a scheduled task that runs every 20 minutes to compare Druid‑aggregated payment counts with detailed transaction counts, raising an alert if discrepancies persist after 500 attempts.
@Override
@Scheduled(cron="0 0/20 * * * ?")
public void teamOrderCheck() {
tidbCheck();
druidCheck();
}
public void druidCheck() {
boolean druidAlert = true;
try {
//druid
for (int i = 0; i < 500; i++){
boolean res = checkOnceTeamOrder();
if (res){
druidAlert = false;
break;
}
Thread.sleep(2000 );
}
String druidPayCnt = getDruidPayCnt();
String detailPayCnt = getDetailPayCnt();
if (druidAlert && !druidPayCnt.equals(detailPayCnt)){
log.warn("500次检测均不通过.");
String content = "实时交易数据异常预警:druid 统计支付订单数:%s, 交易明细支付订单数:%s";
alertBiz.commonAlert(String.format(content,druidPayCnt, detailPayCnt));
}
}catch(Exception e){
log.warn("team order check error.");
}
}3. Real‑Time Data (Stream) Monitoring Details
Real‑time data is calculated from the start of the current day up to the latest update. It covers shop and product dimensions, processing transactions, traffic, marketing, and product data via Flink, with results stored in Druid and TiDB.
3.1 Accuracy Rules
Upstream‑downstream comparison: compare raw logs stored in TiDB with aggregated results.
Yesterday’s real‑time vs. offline comparison: after real‑time data is fully persisted, compare it with the corresponding offline batch results.
3.2 Timeliness Rules
Real‑time timeliness is reflected by stable or slowly changing metrics. Delays are mainly caused by Kafka backlog and Flink configuration/resource issues, which are handled by the operations team.
3.3 Implementation Example
The upstream‑downstream check runs every 20 minutes, performing up to 500 comparisons before triggering an alert.
4. Monitoring Outcomes
Since the first half of 2021, the online monitoring system has generated over 25 alerts, including 18 latency‑related issues and one critical fault, helping merchants detect and resolve data problems before they affect business decisions.
5. Future Plans
Assess the impact scope of each alert to improve incident response and regression testing.
Build a unified data‑quality monitoring dashboard that aggregates metrics from multiple platforms, involving indicator design, real‑time jobs, and front‑back‑end development.
The article concludes with an invitation to join the Youzan Business Technology Testing team.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.