Big Data 21 min read

Data Quality Governance: From Compliance to Reasonableness – Tools, Techniques, and Key Technologies

This article explores how data collection quality can be elevated from mere compliance to true reasonableness by introducing a comprehensive quality‑review toolchain, visual comparison methods, intelligent rule‑based detection, and self‑diagnosis utilities, while also detailing the underlying sample‑library and three‑layer architecture that power these capabilities.

DataFunSummit
DataFunSummit
DataFunSummit
Data Quality Governance: From Compliance to Reasonableness – Tools, Techniques, and Key Technologies

Data collection, in the broad sense, digitizes the real world and records it; the depth, breadth, and accuracy of this collection determine the upper limit of data applications, so data governance must start at the source.

The biggest current challenge in data‑collection governance is quality and efficiency; this article focuses on the quality aspect, first proposing a shift from "compliance" to "reasonableness" and then presenting a quality‑review tool suite with its key technologies.

1. Seeing Data Quality

1.1 A Case Study: login_type Parameter

Assume a data scientist needs to use the terminal‑behavior log field login_type (login type) without knowing its semantics. By visualizing a stacked percentage trend chart of its enumeration values, the following observations are made:

The parameter is enumerated (e.g., WeChat, QQ, phone) and its meaning can be inferred.

In the past two weeks the distribution is stable, showing no major fluctuations.

Both Android and iOS share similar distributions, but the case differences differ (Android uppercase, iOS lowercase).

From this we conclude that login_type is well‑filled, semantically clear, and stable, though the case inconsistency should be corrected.

All these insights are obtained in under two minutes thanks to the high‑density visual chart.

2. Data Quality: Compliance vs. Reasonableness

Traditional data‑quality evaluation uses five dimensions: completeness, validity, accuracy, consistency, and timeliness. This article re‑frames them into two cognitive dimensions:

Compliance

Conforms to the specifications set by the designer (e.g., required fields, value ranges, enumeration limits).

Compliance checks are absolute – either right or wrong.

Reasonableness

Is logical and self‑consistent from multiple perspectives.

Accuracy and consistency are judged through data analysis because we cannot directly compare data to the real world.

Reasonableness lacks a unified standard and often requires domain experts to spot anomalies.

Data must be both compliant (minimum requirement) and reasonable (sufficient for trust).

3. Quality‑Review Tool System

The system consists of three parts: Quality Review, Intelligent Judgment, and Self‑Diagnosis.

3.1 Quality Review – Let Humans See Data Quality

Key idea: emphasize relativity. Every reasonable analysis is a comparison:

Internal comparison – distribution of enumeration values.

Date comparison – day‑over‑day, week‑over‑week trends.

Dual‑endpoint comparison – Android vs. iOS.

All visualizations are built for quick comparison; hundreds of charts are generated daily covering parameters, events, pages, elements, and backend reports.

3.2 Intelligent Judgment – Let Machines Detect Problems

Machine‑driven detection converts human comparison ideas into code. The workflow includes:

Judgment ideas (who to compare with whom).

Judgment operators (distance, difference, compliance).

Configuration, thresholds, and pre‑conditions to control false positives.

Examples of operators:

Manhattan distance (range [0,2]) for distribution differences.

Difference correction for ratio metrics.

Compliance operators for strict rule checks.

The current rule engine uses a Golang‑based DSL (gengine) handling 600k+ daily judgments; future versions may adopt Python DSL with Jupyter support.

3.3 Self‑Diagnosis – Let Humans Quickly Diagnose Issues

Tools include:

Comparison tool – arbitrary two metrics can be contrasted.

Dimension drill‑down – 3D visual drill‑down on any parameter.

Sample extraction – provide a few raw records for quick reference.

SQL templates – auto‑generated queries for non‑technical users.

User‑behavior diagnosis – Gantt‑style timeline with linked statistics for deep case analysis.

4. Key Technical Foundations

4.1 Sample Library

Implemented by sampling 1% of devices at the gateway and persisting the stream. Confidence is evaluated as a weighted mix of PV and UV differences; most business scenarios achieve >99% confidence, providing a cheap (≈1% of warehouse resources) yet reliable basis for rapid insight.

4.2 Three‑Layer Separation Architecture

Traditional monitoring rules are split into three layers:

Computation layer – homogenizes heterogeneous data into unified quality metrics.

Judgment layer – human‑driven review or machine‑driven detection.

Alert layer – human‑friendly notifications to producers and consumers.

This design emphasizes abstraction, lightweight implementation, openness, and whole‑domain applicability.

4.3 Decision Engine & Metric Storage

The first‑generation engine uses a Golang DSL (gengine) with online editing, debugging, and management. Future plans consider a Python‑based DSL supporting third‑party libraries and notebooks.

Metrics are stored in Tencent Cloud CTSDB (ElasticSearch‑based) for its schema‑free design and operational stability.

5. Summary and Outlook

The presented quality‑review concepts and tools have been deployed across multiple Tencent businesses and are gaining importance. Upcoming work will address efficiency (cost‑optimized reporting, columnar packaging, configuration distribution), reporting models (standardized schemas, parameter cascading, auto‑collection), testing tools (visual debugging, real‑time validation, test libraries), and process efficiency (lightweight workflows, clear hand‑offs, intelligent dashboards).

6. Q&A

Q1: How effective is the sample library? Does it mainly concern confidence?

A1: The sample library’s confidence generally exceeds 99%; despite probabilistic sampling, it has not caused false alarms in practice. Specific business scenarios may require tailored confidence assessments.

Q2: How to improve the accuracy component of precision‑recall?

A2: Accuracy is improved by adjusting thresholds, refining operators, and expanding rule expressiveness. Balancing generalization and precision is essential; domain‑knowledge‑driven rule tuning often yields the best results.

Big Datadata qualitydata governancequality inspectionvisual analytics
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.