Big Data 9 min read

Data-CI: A SQL-Based Data Unit Testing Framework for ETL

The article introduces data-ci, a SQL‑driven unit testing framework that lets engineers write, organize, and automate data validation tests for ETL pipelines, providing assertions, failure callbacks, coverage reporting, and CI integration to improve data quality and reliability.

Liulishuo Tech Team

Oct 22, 2017

Data-CI: A SQL-Based Data Unit Testing Framework for ETL

This article presents data-ci , a lightweight framework for writing unit tests directly against tables produced by ETL jobs using SQL. It explains the motivation behind data unit testing, outlines the three‑step validation process (selecting data, checking results, and investigating failures), and describes the shortcomings of ad‑hoc manual checks.

Data‑ci test cases are Python files prefixed with test_(daily|hourly)_ and consist of five parts: description, target_columns, prepare_data_sql, assert_data_ok, and assert_fail_callback_sql. The prepare_data_sql function returns the SQL that extracts the data to be validated, while assert_data_ok contains custom assertion logic written in Python. If an assertion fails, assert_fail_callback_sql runs to retrieve sample erroneous rows.

# -*- coding:utf-8 -*-
# @Time     : 2017-05-22 11:58:34
# @Version  : 1.0
# @Author   : haitao.yao

description = u"测试 user_id 都能跟 db.users JOIN 上"

target_columns = ['user_id']

def prepare_data_sql(date_string):
    return """
SELECT count(*) AS wrong_user_count 
FROM etl.active_login_users a 
LEFT OUTER JOIN db.users b ON a.user_id = b.id 
WHERE a.data_date = ${date_string}
AND b.user_id IS NULL
"""

def assert_data_ok(date_string, result_data):
    for line in result_data:
        if line.wrong_user_count > 0:
            raise Exception('active_login_users.user_id JOIN db.users 失败, 说明有 user_id 不合法 , line: %s' % line)

def assert_fail_callback_sql(date_string):
    return """
SELECT * AS wrong_user_count 
FROM etl.active_login_users a 
LEFT OUTER JOIN db.users b ON a.user_id = b.id 
WHERE a.data_date = ${date_string} 
AND b.user_id IS NULL 
LIMIT 20
"""

The framework organizes test cases under a clear directory hierarchy ( tests/{catalog}/{database}.db/{table_name}/) so that the tested table is obvious. data‑ci consists of three modules: a CLI for argument parsing, a Parser that extracts test metadata, and an SQLExecutor that abstracts the underlying database engine (e.g., Presto, Hive, Redshift).

Command‑line utilities such as generate (to scaffold new test cases) and test‑run (to execute the three validation steps) enable integration with CI pipelines, allowing automated daily runs and alerting on failures.

After execution, data‑ci produces a coverage report that quantifies how many columns of each table are covered by tests, helping engineers assess data reliability and quickly locate problematic rows.

By storing test cases in version control, applying code review, and running them continuously, data‑ci improves data quality, reduces manual debugging effort, and provides measurable coverage for ETL data assets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Data Quality Unit Testing ETL Data Testing

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.