Data Quality Issues, Causes, and Practices in Big Data Platforms
This article explains the harms and root causes of data quality problems—such as integrity, latency, accuracy, and consistency issues—then outlines systematic prevention methods, baseline monitoring, and concrete NetEase YouShu platform practices, illustrated with real incidents, code snippets, and tag‑monitoring strategies.
The author, a big‑data specialist, introduces the article in three parts: the dangers and origins of data quality problems, methods to ensure data quality, and NetEase YouShu’s practical implementations.
1. Harm and Causes of Data Quality Issues
Data quality problems affect four dimensions: integrity, latency, accuracy, and consistency. Real incidents are cited, such as lost traffic logs in 2018 causing a P2 incident, a delayed key task leading to a 4‑hour delay (P3), a mis‑tagged user label causing 300k+ loss (P1), and a Flink task hanging while writing to ES, resulting in inconsistent coupon data.
These issues can lead to financial loss, delayed business analysis, incorrect algorithms, and distrust in data.
2. Root Causes
Two main categories are identified:
System instability – e.g., server configuration changes, log‑DB sync failures, offline/real‑time task slowdowns or crashes, Kafka overload.
Program bugs – e.g., backend data bugs, data‑development bugs, mismatched logic between real‑time and batch jobs, uncommunicated upstream data changes.
3. How to Ensure Data Quality
Adopt a standardized development workflow using NetEase YouShu’s Data Test Center (development → testing → monitoring → release). Steps include initial data shape inspection, primary‑key based before‑after comparisons, code‑view approvals for core assets, and post‑release monitoring of uniqueness, row counts, enumerations, and metric values.
Implement baseline mechanisms via the Task Operations Center, setting 2:30 am, 4:30 am, and 7:30 am baselines for core tables to guarantee timely data delivery. Use pre‑warning alerts for daytime issues and escalation phone calls for baseline breaches.
Deploy the Data Quality Center for integrity, accuracy, and consistency monitoring across logs, core DWD/DIM tables, DWS metrics, and cross‑storage comparisons (MySQL, ES, Kudu).
4. Practical Practices
Development Overhaul : Re‑engineered 10+ backend tables, split one table into four, performed 20+ data model comparisons, and helped the main site fix 8+ issues, ensuring data consistency after migration.
Timeliness Assurance : Established a rotating on‑call schedule (3‑day shifts) with three baselines; used both pre‑warning (SMS, email) and break‑line phone alerts to handle task failures promptly.
Integrity, Accuracy, Consistency Monitoring : Monitored IP count in logs, compared log vs. DB data, and set up real‑time vs. batch data comparisons (Lambda architecture). Implemented unique‑row, row‑count, volatility, and metric checks for user tags stored in Hive, Kudu, ES, and HBase.
Code example for welfare type determination:
case when a.topic = "getCartCoupon" then 1</code><code>when a.topic = "getCoupon" and a.sub_welfare_type not in (6) then 2</code><code>when a.topic = "pushSms" then 3</code><code>when a.topic = "getCoupon" and a.sub_welfare_type in (6) then 6</code><code>when a.topic = "openCard" then 6</code><code>else 0 end as welfare_type,Tag‑Related Monitoring : Conducted uniqueness, row‑count, volatility, and metric monitoring for both detailed and aggregated tag data, and performed cross‑storage consistency checks.
5. Summary
The article concludes with two key takeaways: a standardized real‑time tag development process and a comprehensive data quality monitoring methodology, both illustrated with diagrams. The author hopes the shared experience aids readers in improving their own data quality practices.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
