Building a CI/CD Pipeline with GitHub Actions for Testing and Deploying Apache Airflow DAGs to Amazon MWAA
This guide explains how to create a robust GitHub Actions CI/CD workflow that automatically tests Apache Airflow DAGs using pytest, flake8, and Black, then securely deploys them to Amazon Managed Workflows for Apache Airflow (MWAA) with optional Git hooks and fork‑and‑pull collaboration models.
Introduction
In this article we will learn how to build an effective CI/CD workflow with GitHub Actions for our Apache Airflow DAGs. Using DevOps concepts of continuous integration and continuous delivery, we will automatically test and deploy Airflow DAGs to Amazon Managed Workflows for Apache Airflow (Amazon MWAA) on AWS.
Technologies
Apache Airflow
According to the documentation, Apache Airflow is an open‑source platform for programmatically authoring, scheduling, and monitoring workflows. With Airflow you create workflows as directed acyclic graphs (DAGs) written in Python.
Amazon Managed Workflows for Apache Airflow (MWAA)
Amazon MWAA is a highly available, secure, fully managed service for orchestrating Apache Airflow workflows. MWAA automatically scales execution capacity and integrates with AWS security services for fast, secure data access.
GitHub Actions
GitHub Actions makes CI/CD automation easy. It allows you to build, test, and deploy code directly from GitHub, triggered by events such as pushes, issue creation, or releases, and you can leverage community‑maintained actions.
Glossary
DataOps
DataOps is an automated, process‑oriented approach that data teams use to improve data analysis quality and shorten cycle time. It applies agile methods across the entire data lifecycle, from preparation to reporting.
DevOps
DevOps combines software development (Dev) and IT operations (Ops) practices to shorten system development lifecycles and enable continuous delivery of high‑quality software.
DevOps is a set of practices aimed at shortening the time between committing a change and that change being in production while ensuring high quality. – Wikipedia
Fast Failure
A fast‑failure system reports any condition that may indicate a fault immediately, allowing errors to be discovered early in the SDLC.
Source Code
All source code for this demo, including GitHub Actions, pytest unit tests, and Git hooks, is open‑source on GitHub.
Architecture
The diagram below shows the architecture used in a recent blog post and video demo, where Apache Airflow programmatically loads data from Amazon Redshift and uploads it to an Amazon S3‑based data lake.
We will review how earlier DAGs were developed, tested, and deployed to MWAA using increasingly effective CI/CD workflows. The demonstrated workflow can also be applied to other Airflow resources such as SQL scripts, configuration files, Python requirements, and plugins.
Workflows
No DevOps
This minimal viable workflow loads a DAG directly into Amazon MWAA without applying CI/CD principles. Changes are made locally, copied to an S3 bucket, and automatically synced to MWAA. The changes are also (ideally) pushed back to a central Git repository.
The workflow has two major problems: the DAG can become out of sync between S3 and GitHub, and there is no fast‑failure DevOps concept, so errors may only be discovered after the DAG is imported into MWAA.
GitHub Actions
Compared with the previous workflow, a major improvement is using GitHub Actions to test and deploy code after it is pushed to GitHub. Although the code is still pushed directly to the main branch, the chance of a faulty DAG reaching MWAA is greatly reduced.
GitHub Actions also eliminate human error that could cause the DAG not to sync to S3, and they remove the need for Airflow developers to have direct access to the S3 bucket, improving security.
Test Types
The first GitHub Action test_dags.yml triggers on pushes to the dags directory and on pull‑request events to main . It runs a series of tests: Python dependency checks, code style, code quality, DAG import errors, and unit tests. These tests catch problems before the second Action syncs the DAG to S3.
name: Test DAGs
on:
push:
paths:
- 'dags/**'
pull_request:
branches:
- main
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.7'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements/requirements.txt
pip check
- name: Lint with Flake8
run: |
pip install flake8
flake8 --ignore E501 dags --benchmark -v
- name: Confirm Black code compliance (psf/black)
run: |
pip install pytest-black
pytest dags --black -v
- name: Test with Pytest
run: |
pip install pytest
cd tests || exit
pytest tests.py -vPython Dependencies
This test installs the modules listed in requirements.txt and checks for missing or conflicting packages.
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements/requirements.txt
pip checkIt is essential to develop DAGs with the same Python version and module versions as the Airflow environment. You can retrieve the Python version and installed modules inside Airflow with:
python3 --version; python3 -m pip listAirflow’s latest stable version is 2.2.2 (released 2021‑11‑15). At the time of writing, Amazon MWAA runs version 2.0.2 (released 2021‑04‑19) with Python 3.7.10.
Flake8
Flake8 is a modular source‑code checker that enforces style consistency according to PEP 8. In this demo we ignore rule E501 (line length) to keep the example concise.
- name: Lint with Flake8
run: |
pip install flake8
flake8 --ignore E501 dags --benchmark -vBlack
Black is an uncompromising code formatter that makes all Python code look the same, speeding up code review. The repository uses a pre‑commit Git hook to run Black before committing.
- name: Confirm Black code compliance (psf/black)
run: |
pip install pytest-black
pytest dags --black -vpytest
pytest is a mature, full‑featured testing framework for Python. The test_dags.yml action runs the tests.py file, which contains several unit tests that verify DAG import, naming conventions, tags, owners, retry limits, and more.
import os
import sys
import pytest
from airflow.models import DagBag
sys.path.append(os.path.join(os.path.dirname(__file__), "../dags"))
sys.path.append(os.path.join(os.path.dirname(__file__), "../dags/utilities"))
os.environ["AIRFLOW_VAR_DATA_LAKE_BUCKET"] = "test_bucket"
# ... other environment variables ...
@pytest.fixture(params=["../dags/"])
def dag_bag(request):
return DagBag(dag_folder=request.param, include_examples=False)
def test_no_import_errors(dag_bag):
assert not dag_bag.import_errors
# ... additional tests omitted for brevity ...Fork & Pull
Two collaborative development models are recommended:
Shared‑repository model using feature branches that are reviewed and merged into main .
Fork‑and‑pull model: fork the repository, make changes, open a pull request, and merge after approval and successful tests.
The fork‑and‑pull model greatly reduces the chance of merging bad code before all tests pass.
Sync DAGs to S3
The second GitHub Action sync_dags.yml runs after test_dags.yml completes successfully (or after a pull request is merged) and syncs the dags folder to an S3 bucket.
name: Sync DAGs
on:
workflow_run:
workflows:
- 'Test DAGs'
types:
- completed
pull_request:
types:
- closed
jobs:
deploy:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' }}
steps:
- uses: actions/checkout@master
- uses: jakejarvis/s3-sync-action@master
env:
AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: 'us-east-1'
SOURCE_DIR: 'dags'
DEST_DIR: 'dags'The action requires three encrypted GitHub secrets (AWS credentials and bucket name) that must be created in the repository settings.
Local Testing and Git Hooks
To further improve the CI/CD pipeline, use Git hooks to run tests locally before pushing code. A pre‑push hook can execute the same test suite used in the GitHub Action, preventing bad code from ever reaching the remote repository.
#!/bin/sh
# do nothing if there are no commits to push
if [ -z "$(git log @{u}.. )" ]; then
exit 0
fi
sh ./run_tests_locally.shMake the hook executable:
chmod 755 .git/hooks/pre-pushThe run_tests_locally.sh script runs flake8, Black, and pytest locally:
#!/bin/sh
echo "Starting Flake8 test..."
flake8 --ignore E501 dags --benchmark || exit 1
echo "Starting Black test..."
python3 -m pytest --cache-clear
python3 -m pytest dags/ --black -v || exit 1
echo "Starting Pytest tests..."
cd tests || exit
python3 -m pytest tests.py -v || exit 1
echo "All tests completed successfully! 🥳"References
Testing Airflow DAGs (documentation)
Testing Airflow code (YouTube video)
GitHub: Building and Testing Python (documentation)
Manning: Chapter 9 – Data Pipelines with Apache Airflow
DevOps Cloud Academy
Exploring industry DevOps practices and technical expertise.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.