How to Seamlessly Integrate ByteHouse Cloud Data Warehouse with Apache Airflow
This guide explains how to combine ByteHouse's cloud‑native data warehouse with Apache Airflow to build scalable, automated, and easy‑to‑manage data pipelines, covering business scenarios, data flow, and step‑by‑step installation and DAG creation.
Scalable Reliable Data Pipelines
Apache Airflow provides a powerful platform for designing and orchestrating data workflows, while ByteHouse’s cloud‑native data warehouse efficiently stores and processes large volumes of data, ensuring scalability and reliability.
Automated Workflow Management
Airflow’s visual DAG editor simplifies creating and scheduling ETL processes. Integrated with ByteHouse, it automates extraction, transformation, and loading, reducing manual effort and boosting data management efficiency.
Simple Deployment and Management
Both Airflow and ByteHouse are designed for straightforward deployment. Airflow can run on‑premise or in the cloud, and ByteHouse offers a fully managed cloud‑native solution, making infrastructure setup and maintenance seamless.
Business Scenario
Data Insight Ltd., an analytics firm in e‑commerce, uses Apache Airflow as its pipeline orchestrator and ByteHouse as its warehouse to leverage powerful analytics and machine‑learning capabilities.
Data Flow
The company loads large customer and transaction datasets from AWS S3 into ByteHouse via Airflow‑triggered pipelines, either on a schedule or when new files appear. Airflow retrieves files from S3, authenticates securely, transforms the data, and loads it into ByteHouse using predefined schemas. Once loaded, analysts run SQL‑like queries, generate reports, and build interactive dashboards and predictive models.
Summary
By integrating Apache Airflow with ByteHouse, Data Insight Ltd. achieves a smooth, automated workflow from S3 to the warehouse, fully exploiting ByteHouse’s analytics, machine‑learning, and dashboard features to drive data‑driven decisions.
Step 1: Prerequisites
Install
pipin your virtual/local environment and install the ByteHouse CLI, then log in to your ByteHouse account. Example for macOS using Homebrew:
<code>brew install bytehouse-cli</code>Step 2: Install Apache Airflow
Install Airflow via pip:
<code># Airflow requires a directory, ~/airflow is default
export AIRFLOW_HOME=~/airflow
AIRFLOW_VERSION=2.1.3
PYTHON_VERSION="$(python --version | cut -d ' ' -f 2 | cut -d '.' -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"</code>After installation, run
airflow infofor details.
Step 3: Initialize Airflow
Initialize the metadata database and create an admin user, then start the web server (default port 8080):
<code># Initialize database
airflow db init
airflow users create \
--username admin \
--firstname admin \
--lastname admin \
--role Admin \
--email admin
# Start web server
airflow webserver --port 8080</code>Access the UI at http://localhost:8080/ and log in with the credentials.
Step 4: YAML Configuration
Navigate to the Airflow folder (
cd ~/airflow) and edit
airflow.cfgto connect to a database (SQLite by default, MySQL optional):
<code># Default is SQLite; can also use MySQL
sql_alchemy_conn = mysql+pymysql://airflow:[email protected]:8080/airflow
# sql_alchemy_pool_enabled = False
# dags_folder = /home/admin/airflow/dags</code>Step 5: Create a DAG
Create a
dagsdirectory and a file
test_bytehouse.pywith the following content:
<code>from datetime import timedelta
from textwrap import dedent
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'test_bytehouse',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(1),
tags=['example'],
) as dag:
tImport = BashOperator(
task_id='ch_import',
bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml "INSERT INTO korver.cell_towers_1 FORMAT csv INFILE "/opt/bytehousecli/data.csv""',
)
tSelect = BashOperator(
task_id='ch_select',
bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml -q "select * from korver.cell_towers_1 limit 10 into outfile "/opt/bytehousecli/dataout.csv" format csv"',
)
tSelect >> tImport
</code>Run
python test_bytehouse.pyto register the DAG, then refresh the Airflow UI to see it.
Step 6: Execute the DAG
List tasks and view the hierarchy:
<code># List tasks
airflow tasks list test_bytehouse
# Show tree
airflow tasks list test_bytehouse --tree</code>After execution, verify query history and data load results in the ByteHouse console.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.