Big Data 10 min read

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

This article provides a comprehensive, hands‑on tutorial for connecting a Hive data warehouse to ClickHouse via Seatunnel, covering environment setup, Hive and ClickHouse table creation, full and incremental data import scripts, execution examples, and practical troubleshooting tips.

Big Data Technology & Architecture

Mar 22, 2022

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

Background – The company stores analytical data in a Hive data warehouse and uses Presto for OLAP queries, but rising real‑time requirements demand faster query performance. ClickHouse is introduced for its high‑performance analytics, and Seatunnel is used to bridge Hive and ClickHouse.

01 Environment Preparation

Recommended versions: seatunnel‑1.5.7, spark‑2.4.8, scala‑2.11. Install all packages under /u/module:

unzip /u/software/19.Seatunnel/seatunnel-1.5.7.zip -d /u/module/

tar -zxvf /u/software/19.Seatunnel/spark-2.4.8-bin-hadoop2.7.tgz -C /u/module

tar -zxvf /u/software/19.Seatunnel/scala-2.11.8.tgz -C /u/module

cp $HADOOP_CONF/hive-site.xml /u/module/spark-2.4.8-bin-hadoop2.7/conf

# Set Spark2 environment variables (example)
export SPARK2_HOME=/u/module/spark-2.4.8-bin-hadoop2.7
export SCALA_HOME=/u/module/scala-2.11.8/
export PATH=$PATH:$SPARK2_HOME/bin:$SCALA_HOME/bin

source /etc/profile

Create a jobs directory to store Seatunnel configuration files: mkdir /u/module/seatunnel-1.5.7/jobs 02 Data Preparation

Hive – Create a partitioned Parquet table:

drop table if exists prod_info;
create table prod_info (
    prod_sn    string comment 'sn',
    create_time string comment '创建时间'
) COMMENT '产品信息表'
PARTITIONED BY (dt string)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression" = "lzo");

Insert sample data:

insert into prod_info values ('F0001','2022-01-18 00:00:00.0','2022-01-18');
insert into prod_info values ('F00012','2022-01-19 00:00:00.0','2022-01-19');

ClickHouse – Create a matching table:

drop table if exists prod_info;
create table prod_info (
    prod_sn    String,
    create_time DateTime
) engine = MergeTree
    partition by toYYYYMMDD(create_time)
    primary key (prod_sn)
    ORDER BY (prod_sn);

03 Full and Incremental Data Import to ClickHouse

Write a Bash script ( mytest.sh) that generates a Seatunnel configuration file on the fly and runs Seatunnel. The script supports three modes: first (full import), all (incremental import), and tmp (debug).

#!/bin/bash
# Environment variables
unset SPARK_HOME
export SPARK_HOME=$SPARK2_HOME
SEATUNNEL_HOME=/u/module/seatunnel-1.5.7
CLICKHOUSE_CLIENT=/usr/bin/clickhouse-client

# Parameter handling (first / all / other)
... (parameter parsing logic) ...

import_conf(){
    cat > $SEATUNNEL_HOME/jobs/hive2ck_test.conf <<!EOF
spark {
    spark.sql.catalogImplementation = "hive"
    spark.app.name = "hive2clickhouse"
    spark.executor.instances = 4
    spark.executor.cores = 4
    spark.executor.memory = "4g"
    spark.sql.hive.manageFilesourcePartitions = false
}

input {
    hive {
        pre_sql = "$1"
        table_name = "$2"
    }
}

filter {}

output {
    clickhouse {
        host = "$3"
        database = "$4"
        table = "$5"
        fields = $6
        username = "default"
        password = ""
    }
}
!EOF
    $SEATUNNEL_HOME/bin/start-seatunnel.sh --config $SEATUNNEL_HOME/jobs/hive2ck_test.conf -e client -m 'local[4]'
}

import_prod_info_first(){
    $CLICKHOUSE_CLIENT --host hadoop101 --database test --query "truncate table test.prod_info"
    import_conf "select prod_sn,substring(create_time,1,19) as create_time from default.prod_info" "prod_info" "hadoop101:8123" "test" "prod_info" "[\"prod_sn\",\"create_time\"]"
}

import_prod_info(){
    do_date_2=$(echo $do_date | sed 's/-//g')
    $CLICKHOUSE_CLIENT --host hadoop101 --database test --query "alter table test.prod_info drop partition '$${do_date_2}'"
    import_conf "select prod_sn,substring(create_time,1,19) as create_time from default.prod_info where dt='$${do_date}'" "prod_info" "hadoop101:8123" "test" "prod_info" "[\"prod_sn\",\"create_time\"]"
}

case $1 in
    "prod_info_first") import_prod_info_first ;;
    "prod_info") import_prod_info ;;
    "first") import_prod_info_first ;;
    "all") import_prod_info ;;
    "tmp") import_prod_info ;;
esac

03.1 Full Import (first day)

[hadoop@hadoop101 bin]$ mytest.sh first 2022-01-19

Verify data in ClickHouse and view partitions using:

select * from system.parts p where table = 'prod_info' order by partition desc;

03.2 Daily Incremental Import

Insert a new record into Hive for testing:

hive> insert into prod_info values ('F000123','2022-01-20 00:00:00.0','2022-01-20');

Run the incremental script:

[hadoop@hadoop101 bin]$ mytest.sh all 2022-01-20

The script correctly clears the target partition before loading, ensuring no duplicate data.

04 Summary

The guide demonstrates how to build a production‑grade Seatunnel script that connects Hive and ClickHouse, enabling fast, reliable data transfer for both full and incremental loads, and can be orchestrated with schedulers such as Dolphin Scheduler or Azkaban.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ClickHouse Hive Data Integration Spark SeaTunnel

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.