Cloud Native 10 min read

How to Seamlessly Integrate ByteHouse Cloud Data Warehouse with Apache Airflow

This guide explains how to combine ByteHouse's cloud‑native data warehouse with Apache Airflow to build scalable, automated, and easy‑to‑manage data pipelines, covering business scenarios, data flow, and step‑by‑step installation and DAG creation.

ByteDance Data Platform

Jul 5, 2023

How to Seamlessly Integrate ByteHouse Cloud Data Warehouse with Apache Airflow

Scalable Reliable Data Pipelines

Apache Airflow provides a powerful platform for designing and orchestrating data workflows, while ByteHouse’s cloud‑native data warehouse efficiently stores and processes large volumes of data, ensuring scalability and reliability.

Automated Workflow Management

Airflow’s visual DAG editor simplifies creating and scheduling ETL processes. Integrated with ByteHouse, it automates extraction, transformation, and loading, reducing manual effort and boosting data management efficiency.

Simple Deployment and Management

Both Airflow and ByteHouse are designed for straightforward deployment. Airflow can run on‑premise or in the cloud, and ByteHouse offers a fully managed cloud‑native solution, making infrastructure setup and maintenance seamless.

Business Scenario

Data Insight Ltd., an analytics firm in e‑commerce, uses Apache Airflow as its pipeline orchestrator and ByteHouse as its warehouse to leverage powerful analytics and machine‑learning capabilities.

Data Flow

The company loads large customer and transaction datasets from AWS S3 into ByteHouse via Airflow‑triggered pipelines, either on a schedule or when new files appear. Airflow retrieves files from S3, authenticates securely, transforms the data, and loads it into ByteHouse using predefined schemas. Once loaded, analysts run SQL‑like queries, generate reports, and build interactive dashboards and predictive models.

Summary

By integrating Apache Airflow with ByteHouse, Data Insight Ltd. achieves a smooth, automated workflow from S3 to the warehouse, fully exploiting ByteHouse’s analytics, machine‑learning, and dashboard features to drive data‑driven decisions.

Step 1: Prerequisites

Install pip in your virtual/local environment and install the ByteHouse CLI, then log in to your ByteHouse account. Example for macOS using Homebrew:

brew install bytehouse-cli

Step 2: Install Apache Airflow

Install Airflow via pip:

# Airflow requires a directory, ~/airflow is default
export AIRFLOW_HOME=~/airflow
AIRFLOW_VERSION=2.1.3
PYTHON_VERSION="$(python --version | cut -d ' ' -f 2 | cut -d '.' -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

After installation, run airflow info for details.

Step 3: Initialize Airflow

Initialize the metadata database and create an admin user, then start the web server (default port 8080):

# Initialize database
airflow db init

airflow users create \
  --username admin \
  --firstname admin \
  --lastname admin \
  --role Admin \
  --email admin

# Start web server
airflow webserver --port 8080

Access the UI at http://localhost:8080/ and log in with the credentials.

Step 4: YAML Configuration

Navigate to the Airflow folder ( cd ~/airflow) and edit airflow.cfg to connect to a database (SQLite by default, MySQL optional):

# Default is SQLite; can also use MySQL
sql_alchemy_conn = mysql+pymysql://airflow:[email protected]:8080/airflow
# sql_alchemy_pool_enabled = False
# dags_folder = /home/admin/airflow/dags

Step 5: Create a DAG

Create a dags directory and a file test_bytehouse.py with the following content:

from datetime import timedelta
from textwrap import dedent
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'test_bytehouse',
    default_args=default_args,
    description='A simple tutorial DAG',
    schedule_interval=timedelta(days=1),
    start_date=days_ago(1),
    tags=['example'],
) as dag:
    tImport = BashOperator(
        task_id='ch_import',
        bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml "INSERT INTO korver.cell_towers_1 FORMAT csv INFILE "/opt/bytehousecli/data.csv""',
    )
    tSelect = BashOperator(
        task_id='ch_select',
        bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml -q "select * from korver.cell_towers_1 limit 10 into outfile "/opt/bytehousecli/dataout.csv" format csv"',
    )
    tSelect >> tImport

Run python test_bytehouse.py to register the DAG, then refresh the Airflow UI to see it.

Step 6: Execute the DAG

List tasks and view the hierarchy:

# List tasks
airflow tasks list test_bytehouse
# Show tree
airflow tasks list test_bytehouse --tree

After execution, verify query history and data load results in the ByteHouse console.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python DAG ETL Cloud Data Warehouse Apache Airflow ByteHouse

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.