Integrating Apache Airflow with ByteHouse: A Step‑by‑Step Guide
This guide explains how to integrate Apache Airflow with ByteHouse, highlighting scalability, automated workflow management, and simple deployment, and provides a step‑by‑step tutorial—including prerequisites, installation, configuration, DAG creation, and execution commands—to build a robust data pipeline for analytics and machine learning.
Apache Airflow combined with ByteHouse offers a powerful, efficient solution for managing and executing data workflows, delivering scalable, reliable pipelines, automated orchestration, and easy deployment of a cloud‑native data warehouse.
Main advantages:
1. Scalable and reliable data pipelines: Airflow designs and orchestrates complex workflows, while ByteHouse stores and processes large volumes of data efficiently.
2. Automated workflow management: Visual DAG editor in Airflow simplifies creation and scheduling; integration automates ETL processes.
3. Simple deployment and management: Both tools support local or cloud deployment, with ByteHouse offering a fully managed cloud‑native service.
Customer scenario: A fictional analytics company, “Data Insight Ltd.,” uses Airflow to orchestrate data pipelines that load e‑commerce data from AWS S3 into ByteHouse for analysis, reporting, dashboards, and machine‑learning models.
The pipeline triggers on a schedule or S3 events, retrieves files securely, transforms data, and loads it into ByteHouse, where SQL‑like queries, visual dashboards, and predictive models are built.
Summary of the scenario: By integrating Airflow with ByteHouse, the company achieves automated, end‑to‑end data loading from S3, leverages ByteHouse’s analytical and ML capabilities, and drives data‑driven decision making.
Quick start – Prerequisites: Install pip in your environment and the ByteHouse CLI, then log in to your ByteHouse account.
Installation of Apache Airflow:
# Set AIRFLOW_HOME (optional)
export AIRFLOW_HOME=~/airflow
AIRFLOW_VERSION=2.1.3
PYTHON_VERSION="$(python --version | cut -d ' ' -f 2 | cut -d '.' -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"If pip fails, try pip3 install . After installation, run airflow info for details.
Airflow initialization:
# Initialize the metadata database
airflow db init
# Create an admin user
airflow users create \
--username admin \
--firstname admin \
--lastname admin \
--role Admin \
--email admin
# Start the web server (default port 8080)
airflow webserver --port 8080Access the UI at http://localhost:8080/ with the credentials created above.
YAML / configuration adjustments:
# Example configuration (default SQLite, can use MySQL)
sql_alchemy_conn = mysql+pymysql://airflow:[email protected]:8080/airflow
sql_alchemy_pool_enabled = False
dags_folder = /home/admin/airflow/dagsCreate a DAG to interact with ByteHouse:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'test_bytehouse',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(1),
tags=['example'],
) as dag:
tImport = BashOperator(
task_id='ch_import',
bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml "INSERT INTO korver.cell_towers_1 FORMAT csv INFILE "/opt/bytehousecli/data.csv""',
)
tSelect = BashOperator(
task_id='ch_select',
bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml -q "select * from korver.cell_towers_1 limit 10 into outfile "/opt/bytehousecli/dataout.csv" format csv"',
)
tSelect >> tImportRun python test_bytehouse.py to register the DAG, then refresh the Airflow UI to see it listed.
Execute and inspect the DAG:
# List tasks in the DAG
airflow tasks list test_bytehouse
# Show task hierarchy
airflow tasks list test_bytehouse --treeAfter execution, verify query history and data load results in the ByteHouse console.
About ByteHouse: ByteHouse is ByteDance’s cloud‑native data warehouse built on ClickHouse, offering separated storage and compute, multi‑tenant management, and high scalability and performance. It powers thousands of internal clusters and many external customers.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.