Big Data 12 min read

How DataWorks Turns Data Quality Rules into Code with Data Contracts

This article explains how DataWorks integrates data quality specifications directly into the SQL development workflow using Data Contracts, addressing governance lag, versioning gaps, and trust issues while providing a unified, version‑controlled, and automated quality assurance process for offline data pipelines.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How DataWorks Turns Data Quality Rules into Code with Data Contracts

Introduction

For developers, the core challenge of data quality in offline data development is not merely configuring rules, but ensuring that quality rules can be incorporated into the development delivery process with low cost and high reliability, just like code.

Problems of Separate Development and Governance

Governance lag: quality rules are configured after data goes live, delaying problem detection.

Iteration unsynchronised: when SQL logic changes, rules remain on outdated assumptions, causing false‑positives or missed alerts.

Missing version management: rules are detached from code review, diff and rollback, making them hard to trace.

Rising trust cost: downstream consumers repeatedly confirm data constraints, increasing communication overhead.

DataWorks Solution: Data Contracts for "Code as Quality"

DataWorks introduces the Data Contracts concept, embedding quality rules as YAML Spec files within the development workflow. This achieves a one‑stop development‑governance model where rules share the same lifecycle as SQL nodes.

Develop‑as‑governance: write quality Spec directly in the IDE alongside SQL, giving rules the same lifespan as code.

Engineering management: Spec supports version control, code review, diff comparison, and automatic deployment with the release pipeline.

Closed‑loop execution: the rule becomes a deliverable of the node and is automatically executed during scheduling, ensuring pre‑emptive quality assurance.

Current Dilemma

Typical engineering pipelines separate SQL development from data quality configuration, leading to:

Rules added only after data is online, causing delayed issue discovery.

SQL changes not reflected in rules, resulting in false‑positive or missed alerts.

Quality governance reduced to post‑incident remediation.

Rules not part of code review, making them hard to audit, track, or roll back.

Integrated Workflow in DataWorks

The end‑to‑end process follows the stages: develop → test → submit → schedule → iterate. An example table creation is shown below:

CREATE TABLE IF NOT EXISTS dws_d_dqc_suggesion_demo (
  `id` BIGINT COMMENT '主键',
  `user_id` STRING COMMENT '用户ID',
  `item_id` STRING COMMENT '商品ID',
  `shop_id` STRING COMMENT '店铺ID',
  `name` STRING COMMENT '用户姓名',
  `family_name` STRING COMMENT '姓氏',
  `birth_time` DATETIME COMMENT '日期类型的生日',
  `order_url` STRING COMMENT '下单地址,是一个web页面地址',
  `create_time` DATETIME COMMENT '日期类型的下单时间',
  `order_time` STRING COMMENT '下单时间',
  `user_ip` STRING COMMENT '下单客户端ip',
  `user_mac` STRING COMMENT '下单客户端mac地址',
  `user_agent` STRING COMMENT '下单时的客户端标识',
  `email` STRING COMMENT '用户账号的邮箱',
  `phone_number` STRING COMMENT '用户的联系方式',
  `amount` STRING COMMENT '购买数量',
  `unit_price` DECIMAL(38,18) COMMENT '单价',
  `client_token` STRING COMMENT '下单时生成的全链路唯一标识,避免失败重试的重复下单',
  `status` STRING COMMENT '订单状态,Ready - 就绪、WaitingPayed - 待付款、Payed - 已付款待发货、Canceled - 已取消、Shipped - 已发货、WaitingCollecting - 已送达未领取、Delivered - 已收货、Confirmed - 已确认'
) PARTITIONED BY (ds STRING COMMENT '日期分区,格式yyyymmdd') LIFECYCLE 365;

2.2.1 Configure Rules in the IDE

After completing SQL development, click the "Quality Test" button in the editor toolbar to open the quality‑test panel and define a data‑quality Spec. The UI allows direct editing of the YAML spec, which can be generated automatically by the DataWorks Agent using natural‑language prompts.

Quality test panel
Quality test panel

2.2.2 Development and Testing

Once the Spec is defined, developers can run tests directly in the IDE to validate that the rules behave as expected.

Test results in IDE
Test results in IDE

2.2.3 Submit and Release

After successful testing, the submission process packages the Spec together with the SQL node, pushes it through the version‑control system, and deploys it to the production scheduler.

Release workflow
Release workflow

2.2.4 View Execution Results

During production runs, the quality rules are automatically executed. Users can view scan logs and results in the operations console.

Execution logs
Execution logs

Future Work

Multi‑engine coverage: currently supports MaxCompute SQL; support for EMR, Hologres, StarRocks is in progress.

Lowering Spec barrier: continue improving AI‑assisted generation, syntax highlighting, and bulk editing to reduce the learning curve.

Deeper IDE integration: proactive detection of missing quality configs, automated fix suggestions, and tighter convergence of governance into the development workflow.

The ultimate goal is to make "quality evolving with delivery" the default experience for offline data development.

data engineeringSQLData qualityYAMLDataWorksData Contracts
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.