Operations 15 min read

Understanding Apache Airflow DAGs, Operators, and Scheduling

This article explains Apache Airflow's core concepts, including DAG definitions, scheduling intervals, task dependencies, various operators such as BashOperator, PythonOperator, Branch operators, sensors, and custom operators, and provides code examples and configuration details for building robust data pipelines.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding Apache Airflow DAGs, Operators, and Scheduling

The article begins by describing the challenges of complex data processing logic, script interdependencies, and operational monitoring, and introduces Apache Airflow as a solution, referencing the official documentation and a book on Data Pipelines with Airflow.

It then reviews previously covered Airflow installation and moves on to discuss additional Airflow features, focusing on DAGs (Directed Acyclic Graphs) which are stored as Python files in the DAG_FOLDER directory. A DAG defines a workflow pipeline composed of multiple script tasks.

The key elements of a DAG are outlined:

What each script task does.

When the workflow should start.

The execution order of the scripts.

Task definitions use operators , which can be Bash commands, Python functions, or database operations. Operators are Python objects provided by Airflow and can be extended.

Scheduling is controlled by the schedule_interval parameter, which works together with start_date and end_date to define when a DAG runs. The article shows a daily schedule example with an accompanying diagram.

Dependencies between tasks are expressed with >> or << . The scheduler continuously checks task states and resource availability, moving ready tasks to a queued state.

Trigger rules determine when downstream tasks start, with all_success being the default, and alternatives such as one_success , one_failed , none_failed , and none_skipped available.

Concurrency limits the number of simultaneously running tasks (default 16 task slots).

DAGs can be defined using the with context manager or the @dag decorator. Example using with :

with DAG(
    dag_id='example_bash_operator',
    schedule_interval='0 0 * * *',
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    dagrun_timeout=datetime.timedelta(minutes=60),
    tags=['example', 'example2'],
    params={"example_key": "example_value"},
) as dag:
    ...

Configuration parameters for a DAG are listed, including depends_on_past , email settings, retry policies, queue selection, priority weight, SLA, execution timeout, callbacks, trigger rules, and tags.

DAG runs record execution metadata; run IDs are prefixed with scheduled__ , backfill__ , or manual__ . DAGs can be triggered via CLI or REST API, optionally passing a conf dictionary.

Edge labels and sub‑DAGs/TaskGroups improve UI readability and reusability.

Operator section explains that operators use Jinja templating for common variables (e.g., {{ execution_date }} ) and describes data passing methods:

Using XComs (key‑value store in the Airflow metadata DB). Example:

model_id = context["task_instance"].xcom_pull(task_ids="train_model", key="model_id")

Using op_kwargs to pass templated arguments.

Storing/retrieving data from external databases.

TaskFlow API ( @task decorator) for Python functions that exchange data via .output or **kwargs .

Various built‑in operators are described:

DummyOperator – placeholder or merge node.

BashOperator – runs shell commands or scripts.

Branch operators (e.g., BranchDateTimeOperator , BranchDayOfWeekOperator , BranchPythonOperator ) – conditional branching based on time or logic.

PythonOperator – most common, runs a Python callable with op_args / op_kwargs . Example:

def _get_data(output_path, **context):
    year, month, day, hour, *_ = context["execution_date"].timetuple()
    url = (
        f"https://dumps.wikimedia.org/other/pageviews/{year}/{year}-{month:02d}/"
        f"pageviews-{year}{month:02d}{day:02d}-{hour:02d}0000.gz"
    )
    request.urlretrieve(url, output_path)

get_data = PythonOperator(
    task_id="get_data",
    python_callable=_get_data,
    op_args=["/tmp/wikipageviews.gz"],
    dag=dag,
)

def _calculate_stats(input_path, output_path):
    """Calculates event statistics."""
    Path(output_path).parent.mkdir(exist_ok=True)
    events = pd.read_json(input_path)
    stats = events.groupby(["date", "user"]).size().reset_index()
    stats.to_csv(output_path, index=False)

calculate_stats = PythonOperator(
    task_id="calculate_stats",
    python_callable=_calculate_stats,
    op_kwargs={"input_path": "/data/events.json", "output_path": "/data/stats.csv"},
    dag=dag,
)

TaskFlow example using the @task decorator:

@task(task_id="print_the_context")
def print_context(ds=None, **kwargs):
    """Print the Airflow context and ds variable from the context."""
    pprint(kwargs)
    print(ds)
    return 'Whatever you return gets printed in the logs'

Other operators covered include:

ExternalTaskSensor – waits for a task in another DAG to succeed before proceeding.

LatestOnlyOperator – ensures downstream tasks run only for the most recent DAG run.

Sensors – monitor external conditions (e.g., file existence) and can be made deferrable to reduce resource usage.

Custom operators and hooks – extend Airflow by inheriting from BaseOperator or BaseHook , allowing integration with external systems such as PostgreSQL or S3.

DAGschedulingdata pipelinesOperatorsApache Airflow
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.