Designing a Scalable Data Quality Center for Offline Big‑Data Pipelines
This article describes the design and implementation of a platform‑wide Data Quality Center for offline big‑data pipelines, covering research of existing solutions, design goals, system architecture based on DolphinScheduler, rule definition language, binding and execution mechanisms, and future enhancements such as lineage monitoring and real‑time checks.
Background
In daily data‑development work, a completed task may still produce unreliable results because of upstream data anomalies or bugs in processing logic. Detecting such issues can take a long time, especially for offline workloads, and may lead to incorrect business decisions. This is a typical big‑data quality problem.
Existing Open‑Source and Cloud Solutions
Apache Griffin
Apache Griffin is an eBay‑open‑source data‑quality service built on Hadoop and Spark. Its workflow follows three steps: Define (quality rule definition), Measure (task execution on Spark), and Analyze (result visualization). Griffin implements the six industry‑standard dimensions (Accuracy, Completeness, Timeliness, Uniqueness, Validity, Consistency) but the open‑source project only provides implementations for Accuracy rules. Tasks are scheduled by an internal scheduler and submitted to Spark via Apache Livy, which makes real‑time enforcement difficult.
Architecture Diagram
Qualitis
Qualitis is an open‑source data‑quality management system from WeBank. It relies on Linkis for task dispatch and Spark for execution, and integrates with DataSphereStudio, allowing quality checks to be embedded directly in workflow DAGs, thus improving timeliness.
Alibaba Cloud DataWorks Data Quality
DataWorks provides a one‑stop big‑data platform that includes a data‑quality module. Its implementation depends on other Alibaba Cloud components, offering useful product‑level references for custom design.
Design Goals
Support offline data‑quality management initially.
Provide a generic rule description language and management UI.
Schedule quality tasks with the company’s unified scheduler, enabling strong enforcement (blocking downstream tasks on failure) while keeping scheduler intrusion minimal.
Visualize quality results.
System Design
Scheduling Platform Background
The offline scheduling platform is built on Apache DolphinScheduler (DS), a distributed visual DAG scheduler that supports Shell, Python, Spark, Flink, etc.
Architecture Overview
DS consists of a Master node that listens and schedules tasks and Worker nodes that execute them. Each task must have a cron expression; Quartz generates the DAG command, and only one Master obtains execution rights.
Overall Platform Architecture
DQC Web UI – front‑end for rule creation and management.
DQC (Go) – back‑end service handling entities such as rules, templates, tasks, and results.
DS (Data Quality part) – integrates DQC tasks into the DS scheduling flow with modest modifications.
DQC SDK (JAR) – encapsulates rule parsing, query construction, and execution logic; invoked by DS as a shell‑style task to keep intrusion low.
Rule Specification
Scalar‑Oriented Rule Model
To keep the platform developer‑friendly, rules are expressed as three scalar elements, similar to unit testing:
Actual Value – a scalar extracted from the task’s output Hive table via a user‑provided SQL query.
Expected Value – the scalar that the Actual Value should match.
Assert – a comparison operator (greater than, equal, less than) applicable to numeric scalars.
Example – “field‑null count” rule:
Actual Value = SELECT COUNT(*) FROM result_table WHERE field IS NULL, Expected Value = 0, Assert = greater than.
Rule Management
Rule Templates
Templates abstract reusable components: the SQL definition (with placeholders), the comparison operator, parameter definitions, and metadata. The “null‑field count” template contains a placeholder for the field name and a “greater than” operator.
Rule Entities
Rule entities are concrete instances of a template, specifying the Expected Value, the exact operator, and concrete parameter values. Multiple entities can be created from the same template, e.g., a uniqueness check for user_id in a specific table.
Rule Binding to DS Tasks
Two binding approaches were considered:
Store the full rule JSON in the DAG metadata – requires synchronization on rule updates.
Store a list of Rule IDs in the DAG; the task fetches rule details at execution time – chosen for minimal DS intrusion.
Rule Execution
Strong vs. Weak Rules
Strong rules run synchronously with the bound task; a failure marks the task as failed and blocks downstream execution.
Weak rules run asynchronously; failures generate alerts but do not affect the original task’s status.
DQC Task & DQC SDK Implementation
A DQC Task extends DS’s AbstractTask and implements the handle method. Execution steps:
Extract the list of Rule IDs bound to the Job Task.
Fetch rule details for each ID.
Construct the full SQL query by injecting parameters into the template.
Execute the query (via Spark or Presto).
Apply the Assert operator to compare Actual and Expected values.
Two query engines were evaluated:
Spark – highly configurable but requires more setup.
Presto – no extra configuration, simpler to develop, and performance is acceptable for offline workloads.
Presto was selected. When multiple rules target the same table, their SQLs are aggregated to reduce I/O, and different SQLs are executed in parallel threads.
Execution Results
Each rule’s result is displayed in the DQC Web UI. Task‑level aggregation is not yet implemented. Failed rules trigger alerts to designated recipients.
Practical Considerations
Key open questions:
Which rules are essential for a given task to ensure data quality?
How to prioritize rule coverage when it is impossible to add rules for every table and column?
These decisions require domain knowledge; the platform provides the tooling, while developers must enforce data‑quality standards within their teams.
Future Roadmap
Build lineage‑based, end‑to‑end data‑quality monitoring to detect downstream impacts of weak rule violations.
Define quantitative metrics to measure overall data‑quality health.
Support real‑time data‑quality checks for streaming workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
