How DataWorks Turns Data Quality Rules into Code with Data Contracts
This article explains how DataWorks integrates data quality specifications directly into the SQL development workflow using Data Contracts, addressing governance lag, versioning gaps, and trust issues while providing a unified, version‑controlled, and automated quality assurance process for offline data pipelines.
Introduction
For developers, the core challenge of data quality in offline data development is not merely configuring rules, but ensuring that quality rules can be incorporated into the development delivery process with low cost and high reliability, just like code.
Problems of Separate Development and Governance
Governance lag: quality rules are configured after data goes live, delaying problem detection.
Iteration unsynchronised: when SQL logic changes, rules remain on outdated assumptions, causing false‑positives or missed alerts.
Missing version management: rules are detached from code review, diff and rollback, making them hard to trace.
Rising trust cost: downstream consumers repeatedly confirm data constraints, increasing communication overhead.
DataWorks Solution: Data Contracts for "Code as Quality"
DataWorks introduces the Data Contracts concept, embedding quality rules as YAML Spec files within the development workflow. This achieves a one‑stop development‑governance model where rules share the same lifecycle as SQL nodes.
Develop‑as‑governance: write quality Spec directly in the IDE alongside SQL, giving rules the same lifespan as code.
Engineering management: Spec supports version control, code review, diff comparison, and automatic deployment with the release pipeline.
Closed‑loop execution: the rule becomes a deliverable of the node and is automatically executed during scheduling, ensuring pre‑emptive quality assurance.
Current Dilemma
Typical engineering pipelines separate SQL development from data quality configuration, leading to:
Rules added only after data is online, causing delayed issue discovery.
SQL changes not reflected in rules, resulting in false‑positive or missed alerts.
Quality governance reduced to post‑incident remediation.
Rules not part of code review, making them hard to audit, track, or roll back.
Integrated Workflow in DataWorks
The end‑to‑end process follows the stages: develop → test → submit → schedule → iterate. An example table creation is shown below:
CREATE TABLE IF NOT EXISTS dws_d_dqc_suggesion_demo (
`id` BIGINT COMMENT '主键',
`user_id` STRING COMMENT '用户ID',
`item_id` STRING COMMENT '商品ID',
`shop_id` STRING COMMENT '店铺ID',
`name` STRING COMMENT '用户姓名',
`family_name` STRING COMMENT '姓氏',
`birth_time` DATETIME COMMENT '日期类型的生日',
`order_url` STRING COMMENT '下单地址,是一个web页面地址',
`create_time` DATETIME COMMENT '日期类型的下单时间',
`order_time` STRING COMMENT '下单时间',
`user_ip` STRING COMMENT '下单客户端ip',
`user_mac` STRING COMMENT '下单客户端mac地址',
`user_agent` STRING COMMENT '下单时的客户端标识',
`email` STRING COMMENT '用户账号的邮箱',
`phone_number` STRING COMMENT '用户的联系方式',
`amount` STRING COMMENT '购买数量',
`unit_price` DECIMAL(38,18) COMMENT '单价',
`client_token` STRING COMMENT '下单时生成的全链路唯一标识,避免失败重试的重复下单',
`status` STRING COMMENT '订单状态,Ready - 就绪、WaitingPayed - 待付款、Payed - 已付款待发货、Canceled - 已取消、Shipped - 已发货、WaitingCollecting - 已送达未领取、Delivered - 已收货、Confirmed - 已确认'
) PARTITIONED BY (ds STRING COMMENT '日期分区,格式yyyymmdd') LIFECYCLE 365;2.2.1 Configure Rules in the IDE
After completing SQL development, click the "Quality Test" button in the editor toolbar to open the quality‑test panel and define a data‑quality Spec. The UI allows direct editing of the YAML spec, which can be generated automatically by the DataWorks Agent using natural‑language prompts.
2.2.2 Development and Testing
Once the Spec is defined, developers can run tests directly in the IDE to validate that the rules behave as expected.
2.2.3 Submit and Release
After successful testing, the submission process packages the Spec together with the SQL node, pushes it through the version‑control system, and deploys it to the production scheduler.
2.2.4 View Execution Results
During production runs, the quality rules are automatically executed. Users can view scan logs and results in the operations console.
Future Work
Multi‑engine coverage: currently supports MaxCompute SQL; support for EMR, Hologres, StarRocks is in progress.
Lowering Spec barrier: continue improving AI‑assisted generation, syntax highlighting, and bulk editing to reduce the learning curve.
Deeper IDE integration: proactive detection of missing quality configs, automated fix suggestions, and tighter convergence of governance into the development workflow.
The ultimate goal is to make "quality evolving with delivery" the default experience for offline data development.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
