A Comprehensive Introduction to Apache Airflow: Architecture, Installation, and Usage
This article provides an in‑depth overview of Apache Airflow, covering its core concepts, advantages, architecture components, installation steps, example ETL DAG code, common command‑line tools, and practical tips for leveraging Airflow in data engineering workflows.
Apache Airflow is an open‑source platform for orchestrating, scheduling, and monitoring workflows, originally developed by Airbnb and now an Apache incubating project. It represents workflows as Directed Acyclic Graphs (DAGs) and provides a rich CLI and web UI for managing tasks.
Key Advantages of Airflow
Flexible and easy to use: written in Python, allowing any task to be expressed as code.
Powerful: over 15 built‑in operators (shell, Python, MySQL, Hive, etc.) and extensible with custom operators.
Elegant: Jinja templating for parameterization and a human‑readable UI.
Highly extensible: multiple executors (LocalExecutor, CeleryExecutor, etc.) enable unlimited scaling.
Rich command‑line tools for testing, deployment, and maintenance.
Airflow is free and can replace many ad‑hoc scripts (cron, ETL, monitoring) by centralizing them, automatically emailing logs on failures, and supporting distributed execution.
Airflow Architecture and Components
The architecture consists of the following core components:
Metadata Database : stores task state information.
Scheduler : parses DAG files, creates DAG runs, and decides which tasks to execute.
Executor : dispatches tasks to workers; examples include LocalExecutor and CeleryExecutor.
Workers : actual processes that run task logic.
Key concepts include Scheduler , DAG , DAGRun , TaskInstance , Executor , LocalTaskJob , and TaskRunner .
Installation and First Steps
Airflow requires a Python environment. Installation can be done via pip:
# Set AIRFLOW_HOME if needed
export AIRFLOW_HOME=~/airflow
# Install Airflow
pip install apache-airflow
# Initialize the metadata database
airflow initdb
# Start the web server (default port 8080)
airflow webserver -p 8080
# Start the scheduler
airflow schedulerBy default Airflow uses SQLite; you can switch to MySQL by editing airflow.cfg:
# Example configuration snippet
executor = LocalExecutor
sql_alchemy_conn = mysql://root:xxxxxx@localhost:3306/airflowAfter starting the UI you can view DAGs, switch to tree view, Gantt chart, etc.
Hello Airflow! – A Simple ETL Example
The following Python script defines a basic ETL DAG with three tasks: extract, transform, and load. Save it as tutorial.py in the DAGs folder (default ~/airflow/dags) and run it.
"""
### ETL DAG Tutorial Documentation
This ETL DAG is compatible with Airflow 1.10.x (specifically tested with 1.10.12) and is referenced
as part of the documentation that goes along with the Airflow Functional DAG tutorial located
[here](https://airflow.apache.org/tutorial_decorated_flows.html)
"""
# [START tutorial]
# [START import_module]
import json
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
# [END import_module]
# [START default_args]
default_args = {
'owner': 'airflow',
}
# [END default_args]
# [START instantiate_dag]
with DAG(
'tutorial_etl_dag',
default_args=default_args,
description='ETL DAG tutorial',
schedule_interval=None,
start_date=days_ago(2),
tags=['example'],
) as dag:
# [END instantiate_dag]
dag.doc_md = __doc__
def extract(**kwargs):
ti = kwargs['ti']
data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}'
ti.xcom_push('order_data', data_string)
def transform(**kwargs):
ti = kwargs['ti']
extract_data_string = ti.xcom_pull(task_ids='extract', key='order_data')
order_data = json.loads(extract_data_string)
total_order_value = sum(order_data.values())
ti.xcom_push('total_order_value', json.dumps({"total_order_value": total_order_value}))
def load(**kwargs):
ti = kwargs['ti']
total_value_string = ti.xcom_pull(task_ids='transform', key='total_order_value')
total_order_value = json.loads(total_value_string)
print(total_order_value)
extract_task = PythonOperator(task_id='extract', python_callable=extract)
transform_task = PythonOperator(task_id='transform', python_callable=transform)
load_task = PythonOperator(task_id='load', python_callable=load)
extract_task >> transform_task >> load_task
# [END main_flow]
# [END tutorial]Run the DAG with:
# List active DAGs
airflow list_dags
# List tasks in the tutorial DAG
airflow list_tasks tutorial
# Show task hierarchy
airflow list_tasks tutorial --treeCommon Airflow commands include backfill, list_tasks, clear, pause, trigger_dag, webserver, scheduler, etc.
Overall, Airflow offers a gentle learning curve, strong extensibility, and is widely adopted by companies such as Adobe, Airbnb, Google, Lyft, and Alibaba.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
