Data-CI: A SQL-Based Data Unit Testing Framework for ETL
The article introduces data-ci, a SQL‑driven unit testing framework that lets engineers write, organize, and automate data validation tests for ETL pipelines, providing assertions, failure callbacks, coverage reporting, and CI integration to improve data quality and reliability.
This article presents data-ci , a lightweight framework for writing unit tests directly against tables produced by ETL jobs using SQL. It explains the motivation behind data unit testing, outlines the three‑step validation process (selecting data, checking results, and investigating failures), and describes the shortcomings of ad‑hoc manual checks.
Data‑ci test cases are Python files prefixed with test_(daily|hourly)_ and consist of five parts: description , target_columns , prepare_data_sql , assert_data_ok , and assert_fail_callback_sql . The prepare_data_sql function returns the SQL that extracts the data to be validated, while assert_data_ok contains custom assertion logic written in Python. If an assertion fails, assert_fail_callback_sql runs to retrieve sample erroneous rows.
# -*- coding:utf-8 -*-
# @Time : 2017-05-22 11:58:34
# @Version : 1.0
# @Author : haitao.yao
description = u"测试 user_id 都能跟 db.users JOIN 上"
target_columns = ['user_id']
def prepare_data_sql(date_string):
return """
SELECT count(*) AS wrong_user_count
FROM etl.active_login_users a
LEFT OUTER JOIN db.users b ON a.user_id = b.id
WHERE a.data_date = ${date_string}
AND b.user_id IS NULL
"""
def assert_data_ok(date_string, result_data):
for line in result_data:
if line.wrong_user_count > 0:
raise Exception('active_login_users.user_id JOIN db.users 失败, 说明有 user_id 不合法 , line: %s' % line)
def assert_fail_callback_sql(date_string):
return """
SELECT * AS wrong_user_count
FROM etl.active_login_users a
LEFT OUTER JOIN db.users b ON a.user_id = b.id
WHERE a.data_date = ${date_string}
AND b.user_id IS NULL
LIMIT 20
"""The framework organizes test cases under a clear directory hierarchy ( tests/{catalog}/{database}.db/{table_name}/ ) so that the tested table is obvious. data‑ci consists of three modules: a CLI for argument parsing, a Parser that extracts test metadata, and an SQLExecutor that abstracts the underlying database engine (e.g., Presto, Hive, Redshift).
Command‑line utilities such as generate (to scaffold new test cases) and test‑run (to execute the three validation steps) enable integration with CI pipelines, allowing automated daily runs and alerting on failures.
After execution, data‑ci produces a coverage report that quantifies how many columns of each table are covered by tests, helping engineers assess data reliability and quickly locate problematic rows.
By storing test cases in version control, applying code review, and running them continuously, data‑ci improves data quality, reduces manual debugging effort, and provides measurable coverage for ETL data assets.
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.