Intelligent and Automated Data Quality Management in Big Data Systems
This article explores the challenges of data quality in mature big‑data environments and presents intelligent, automated approaches—including assertions, automatic detection, rule recommendation, link checking, and collaborative mechanisms—to embed quality checks throughout the data pipeline, improving efficiency and reliability.
With the internet industry now in a mature big‑data application era, the focus has shifted from data usage to data governance, especially data quality, which measures how well data meets business needs.
The talk is organized into three parts: industry trends, basic concepts and detection methods for data quality, and intelligent data quality solutions.
Industry Trends – Modern big‑data systems (e.g., Volcano Engine’s cloud data platform) have become easier to use, but as data volume grows, governance problems such as data quality become prominent. Gartner’s 2022 data‑management maturity curve highlights Augmented Data Quality, DataOps, and Data Observability as key focus areas.
Data Quality Basics – Classic quality dimensions are expressed as Assertions (similar to programming assertions). Typical metrics include freshness, volume, correctness, completeness, uniqueness, and integrity. The typical workflow consists of data profiling, rule configuration (often via a DSL), and routine monitoring (both one‑off unit tests and continuous checks).
In practice, two rules dominate: table‑row count and primary‑key duplication, accounting for over 80% of configured rules, mainly because they are easy to set up. However, rule penetration is low due to high configuration cost, the need for deep domain knowledge, and reactive rule addition after incidents.
Intelligent Data Quality – Four pillars are proposed to improve rule penetration:
Automatic detection: algorithms automatically flag anomalies such as zero‑row tables, sudden volume spikes, or hourly null‑rate spikes without manual rule definition.
Rule recommendation: intelligent suggestions based on thresholds, fields, or tables reduce configuration effort across five data scenarios (external data entry, data‑pipeline development, data application, model features, and business usage).
Link checking: using full‑lineage information to diagnose upstream causes of downstream quality issues, enabling root‑cause analysis.
Collaboration mechanisms: establishing quality expectation agreements between data developers and data consumers to bridge perception gaps and allow consumers to add targeted rules.
These methods are integrated into Volcano Engine’s DataLeap platform, which provides end‑to‑end data quality capabilities such as autonomous profiling, strong rule enforcement, and data validation.
Overall, combining foundational assertions, automation, system integration, and open collaboration forms a comprehensive data‑quality assurance framework for large‑scale data environments.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.