Big Data 18 min read

Intelligent and Automated Data Quality Management in Big Data Systems

This article explores the challenges of data quality in mature big‑data environments and presents intelligent, automated approaches—including assertions, automatic detection, rule recommendation, link checking, and collaborative mechanisms—to embed quality checks throughout the data pipeline, improving efficiency and reliability.

DataFunSummit

Jul 15, 2023

Intelligent and Automated Data Quality Management in Big Data Systems

With the internet industry now in a mature big‑data application era, the focus has shifted from data usage to data governance, especially data quality, which measures how well data meets business needs.

The talk is organized into three parts: industry trends, basic concepts and detection methods for data quality, and intelligent data quality solutions.

Industry Trends – Modern big‑data systems (e.g., Volcano Engine’s cloud data platform) have become easier to use, but as data volume grows, governance problems such as data quality become prominent. Gartner’s 2022 data‑management maturity curve highlights Augmented Data Quality, DataOps, and Data Observability as key focus areas.

Data Quality Basics – Classic quality dimensions are expressed as Assertions (similar to programming assertions). Typical metrics include freshness, volume, correctness, completeness, uniqueness, and integrity. The typical workflow consists of data profiling, rule configuration (often via a DSL), and routine monitoring (both one‑off unit tests and continuous checks).

In practice, two rules dominate: table‑row count and primary‑key duplication, accounting for over 80% of configured rules, mainly because they are easy to set up. However, rule penetration is low due to high configuration cost, the need for deep domain knowledge, and reactive rule addition after incidents.

Intelligent Data Quality – Four pillars are proposed to improve rule penetration:

Automatic detection: algorithms automatically flag anomalies such as zero‑row tables, sudden volume spikes, or hourly null‑rate spikes without manual rule definition.

Rule recommendation: intelligent suggestions based on thresholds, fields, or tables reduce configuration effort across five data scenarios (external data entry, data‑pipeline development, data application, model features, and business usage).

Link checking: using full‑lineage information to diagnose upstream causes of downstream quality issues, enabling root‑cause analysis.

Collaboration mechanisms: establishing quality expectation agreements between data developers and data consumers to bridge perception gaps and allow consumers to add targeted rules.

These methods are integrated into Volcano Engine’s DataLeap platform, which provides end‑to‑end data quality capabilities such as autonomous profiling, strong rule enforcement, and data validation.

Overall, combining foundational assertions, automation, system integration, and open collaboration forms a comprehensive data‑quality assurance framework for large‑scale data environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation Data Quality Data Governance DataOps Data Observability

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.