Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform
This article explains how X‑Select’s Data Quality Platform (DQC) addresses common data quality problems in large‑scale data development by defining six quality dimensions, leveraging open‑source solutions such as Apache Griffin and Qualitis, and implementing rule definition, execution, alerting, and workflow interruption within a Spark‑based architecture.
1. Background
In data development, quality issues are often discovered passively, leading to costly re‑runs and resource waste. Typical scenarios include missing handling of new enumeration values after business changes and mismatched record counts after data synchronization.
Data quality is a universal challenge; companies need timely alerts or pipeline blocking to minimize downstream impact. Traditional approaches rely on custom jobs and scripts, which are hard to maintain and lack unified standards.
2. Architecture Design
2.1 Industry Analysis
Apache Griffin (eBay) – an open‑source data quality service built on Hadoop and Spark. Its architecture follows a Define‑>Measure‑>Analyze flow, with rule implementation currently limited to the Accuracy dimension. Griffin’s scheduling uses an internal scheduler and Apache Livy to submit Spark jobs, which does not integrate well with upstream‑triggered workflows.
Qualitis (WeBank) – a Spring‑Boot based platform that relies on Linkis for computation, offering quality model building, execution, and task management. It extends Griffin with features such as abnormal data, log, and resource management.
DataWorks (Alibaba Cloud) – a one‑stop big‑data platform that includes data quality solutions and integrates tightly with its scheduling system, enabling pipeline interruption.
2.2 X‑Select DQC Architecture
The DQC platform consists of four main components:
DQC‑Service: manages quality rules, validation, and result display.
DQC‑DS: connects to the Metadata Center to obtain catalog connection information.
DQC‑Scheduler: registers and schedules rule tasks on the SOL scheduler.
DQC‑Executor: receives parameters from the service, packages them into Spark jobs, executes the rules, and writes back results.
3. Core Module Implementation
3.1 Rule Definition
Rules are expressed in SQL, which is familiar to data engineers. Three hierarchical levels are defined:
Monitoring Object – each table is a monitoring object.
Rule Group – manages scheduling frequency for a set of rules.
Rule – the finest granularity, e.g., row count, null count, duplicate count.
Six quality dimensions (Accuracy, Completeness, Timeliness, Uniqueness, Validity, Consistency) are abstracted into rule templates. The platform provides 22 built‑in templates covering table‑level, column‑level, cross‑table, and cross‑source checks.
3.2 Rule Execution
DQC uses Spark as the execution engine due to its support for YARN resource isolation, dynamic scaling, and multi‑source queries. A catalog layer (catalog.database.table) enables federated queries across data sources. Execution follows three phases:
Init : parse parameters, build context, and start Spark.
Run : extract data via DataConnector, register TempViews, replace placeholders in DQC SQL, and run the SparkSQL.
Stop : clean up context.
Rules can be triggered by timed schedules or workflow dependencies. Alerts are sent via weak (WeChat) and strong (phone) notifications. If a rule fails, the DQC‑Executor can interrupt downstream tasks by acting as a data‑quality operator within the workflow.
3.3 Result Evaluation
Two comparison modes are supported: fixed‑value and fluctuation (daily/weekly/monthly). Fixed‑value works for metrics like null count, while fluctuation is suitable for row counts that naturally vary. The platform provides both modes to reduce false positives.
Alerting includes weak notifications to all configured users and strong escalations via phone calls until acknowledgment.
4. Summary & Outlook
Currently DQC runs over 1,200 rules daily, monitoring more than 300 tables with 4,000+ executions per day, catching critical issues promptly and improving data availability. Future work includes establishing a more comprehensive SLA mechanism, expanding customizable rule templates, and supporting real‑time validation.
References: 1. https://github.com/WeBankFinTech/Qualitis 2. https://github.com/apache/griffin 3. https://help.aliyun.com/document_detail/73660.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xingsheng Youxuan Technology Community
Xingsheng Youxuan Technology Official Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
