Data Collection Quality Review: From Compliance to Reasonableness and Toolchain Overview
This article explores data collection governance by distinguishing data quality compliance from reasonableness, introduces a comprehensive quality review tool suite—including visual inspection, intelligent judgment, and self‑diagnosis—detailing its architecture, key techniques, and practical case studies for ensuring reliable data metrics.
Data collection, broadly defined as digitizing and recording the real world, determines the upper limit of data applications; therefore governance must start at the source. This article focuses on data quality challenges, presenting a perspective that elevates quality from mere compliance to reasonableness and describing a suite of quality‑review tools.
1. Seeing Data Quality
A case study uses the public parameter login_type (login type) to illustrate how visual charts can reveal data quality. The stacked percentage chart shows that the parameter is an enumeration with stable distribution over the past two weeks, but a case mismatch exists between Android (uppercase) and iOS (lowercase).
The parameter is well‑filled and semantically clear, yet the case inconsistency needs attention.
From no knowledge to a decision, the whole process takes less than two minutes thanks to the high information density of the chart.
2. Data Quality: Compliance vs. Reasonableness
Traditional data‑quality dimensions (completeness, validity, accuracy, consistency, timeliness) are regrouped into two higher‑level concepts:
Compliance : conforms to the specifications set by designers (e.g., required fields, value ranges). Compliance checks are absolute.
Reasonableness : conforms to common sense and is self‑consistent. It requires analysis (e.g., distribution, ratios) and cannot be fully automated.
Data must be both compliant and reasonable; only then can we trust its quality.
3. Quality Review Tool System
The system consists of three parts: Quality Review, Intelligent Judgment, and Self‑Diagnosis.
3.1 Quality Review
Key idea: emphasize relativity through multiple comparisons (internal, date, dual‑end, version, related, external). Visual charts (hundreds of them) enable quick insight into parameter distribution, trends, and cross‑platform differences.
3.2 Intelligent Judgment
Transforms human comparison ideas into program code. Typical judgment rules include:
Main‑stream date ring‑compare (today vs. past N‑day average).
Gray‑release vs. main‑stream compare.
Gray‑release dual‑end compare (Android vs. iOS).
Gray‑release vs. previous gray‑release (TBD).
Operators such as Manhattan distance, difference correction, and compliance checks are applied to quantify distribution differences and ratio deviations.
3.3 Self‑Diagnosis
Provides interactive tools for users to compare any two quality metrics, drill down dimensions, extract sample data, and generate SQL templates for ad‑hoc analysis. A user‑behavior diagnosis view (Gantt‑style) helps trace individual sessions when deeper investigation is required.
4. Key Technical Analysis
4.1 Sample Library
Samples are collected by routing 1 % of device traffic to a side‑stream and persisting it. Confidence is evaluated as a weighted mix of PV and UV differences; typical confidence exceeds 99 %.
4.2 Three‑Layer Architecture
Traditional monitoring rules are split into three layers:
Computation Layer : homogenises heterogeneous data into unified quality metrics.
Decision Layer : extracts problems from metrics (human review or machine judgment).
Alert Layer : delivers human‑readable notifications to producers and consumers, tracking receipt and resolution rates.
4.3 Decision Engine & Metric Storage
The first‑generation engine uses a Golang‑based DSL (gengine) to execute >600 k decisions per day. Future versions may adopt Python for richer libraries. Metrics are stored in Tencent Cloud CTSDB (Elasticsearch‑based) for schema‑free, high‑performance time‑series queries.
5. Summary & Outlook
The presented quality‑review methodology and tools have been deployed across multiple Tencent businesses and are gaining importance. Future work will address efficiency (cost‑tiered reporting, columnar packing), reporting models (standardised, cascaded parameters, auto‑collection), testing tools (visual debugging, real‑time validation), and workflow optimisation (light processes, intelligent dashboards).
Q&A
Q1: How effective is the sample library? A1: Confidence typically exceeds 99 %; no false‑positive cases have been observed so far.
Q2: How to improve precision in recall‑oriented scenarios? A2: Adjust thresholds, refine algorithms, and balance generalisation with accuracy; no universal solution exists.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
