Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide
This article provides a comprehensive, hands‑on tutorial for connecting a Hive data warehouse to ClickHouse via Seatunnel, covering environment setup, Hive and ClickHouse table creation, full and incremental data import scripts, execution examples, and practical troubleshooting tips.
Background – The company stores analytical data in a Hive data warehouse and uses Presto for OLAP queries, but rising real‑time requirements demand faster query performance. ClickHouse is introduced for its high‑performance analytics, and Seatunnel is used to bridge Hive and ClickHouse.
01 Environment Preparation
Recommended versions: seatunnel‑1.5.7, spark‑2.4.8, scala‑2.11. Install all packages under /u/module:
unzip /u/software/19.Seatunnel/seatunnel-1.5.7.zip -d /u/module/
tar -zxvf /u/software/19.Seatunnel/spark-2.4.8-bin-hadoop2.7.tgz -C /u/module
tar -zxvf /u/software/19.Seatunnel/scala-2.11.8.tgz -C /u/module
cp $HADOOP_CONF/hive-site.xml /u/module/spark-2.4.8-bin-hadoop2.7/conf
# Set Spark2 environment variables (example)
export SPARK2_HOME=/u/module/spark-2.4.8-bin-hadoop2.7
export SCALA_HOME=/u/module/scala-2.11.8/
export PATH=$PATH:$SPARK2_HOME/bin:$SCALA_HOME/bin
source /etc/profileCreate a jobs directory to store Seatunnel configuration files: mkdir /u/module/seatunnel-1.5.7/jobs 02 Data Preparation
Hive – Create a partitioned Parquet table:
drop table if exists prod_info;
create table prod_info (
prod_sn string comment 'sn',
create_time string comment '创建时间'
) COMMENT '产品信息表'
PARTITIONED BY (dt string)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression" = "lzo");Insert sample data:
insert into prod_info values ('F0001','2022-01-18 00:00:00.0','2022-01-18');
insert into prod_info values ('F00012','2022-01-19 00:00:00.0','2022-01-19');ClickHouse – Create a matching table:
drop table if exists prod_info;
create table prod_info (
prod_sn String,
create_time DateTime
) engine = MergeTree
partition by toYYYYMMDD(create_time)
primary key (prod_sn)
ORDER BY (prod_sn);03 Full and Incremental Data Import to ClickHouse
Write a Bash script ( mytest.sh) that generates a Seatunnel configuration file on the fly and runs Seatunnel. The script supports three modes: first (full import), all (incremental import), and tmp (debug).
#!/bin/bash
# Environment variables
unset SPARK_HOME
export SPARK_HOME=$SPARK2_HOME
SEATUNNEL_HOME=/u/module/seatunnel-1.5.7
CLICKHOUSE_CLIENT=/usr/bin/clickhouse-client
# Parameter handling (first / all / other)
... (parameter parsing logic) ...
import_conf(){
cat > $SEATUNNEL_HOME/jobs/hive2ck_test.conf <<!EOF
spark {
spark.sql.catalogImplementation = "hive"
spark.app.name = "hive2clickhouse"
spark.executor.instances = 4
spark.executor.cores = 4
spark.executor.memory = "4g"
spark.sql.hive.manageFilesourcePartitions = false
}
input {
hive {
pre_sql = "$1"
table_name = "$2"
}
}
filter {}
output {
clickhouse {
host = "$3"
database = "$4"
table = "$5"
fields = $6
username = "default"
password = ""
}
}
!EOF
$SEATUNNEL_HOME/bin/start-seatunnel.sh --config $SEATUNNEL_HOME/jobs/hive2ck_test.conf -e client -m 'local[4]'
}
import_prod_info_first(){
$CLICKHOUSE_CLIENT --host hadoop101 --database test --query "truncate table test.prod_info"
import_conf "select prod_sn,substring(create_time,1,19) as create_time from default.prod_info" "prod_info" "hadoop101:8123" "test" "prod_info" "[\"prod_sn\",\"create_time\"]"
}
import_prod_info(){
do_date_2=$(echo $do_date | sed 's/-//g')
$CLICKHOUSE_CLIENT --host hadoop101 --database test --query "alter table test.prod_info drop partition '$${do_date_2}'"
import_conf "select prod_sn,substring(create_time,1,19) as create_time from default.prod_info where dt='$${do_date}'" "prod_info" "hadoop101:8123" "test" "prod_info" "[\"prod_sn\",\"create_time\"]"
}
case $1 in
"prod_info_first") import_prod_info_first ;;
"prod_info") import_prod_info ;;
"first") import_prod_info_first ;;
"all") import_prod_info ;;
"tmp") import_prod_info ;;
esac03.1 Full Import (first day)
[hadoop@hadoop101 bin]$ mytest.sh first 2022-01-19Verify data in ClickHouse and view partitions using:
select * from system.parts p where table = 'prod_info' order by partition desc;03.2 Daily Incremental Import
Insert a new record into Hive for testing:
hive> insert into prod_info values ('F000123','2022-01-20 00:00:00.0','2022-01-20');Run the incremental script:
[hadoop@hadoop101 bin]$ mytest.sh all 2022-01-20The script correctly clears the target partition before loading, ensuring no duplicate data.
04 Summary
The guide demonstrates how to build a production‑grade Seatunnel script that connects Hive and ClickHouse, enabling fast, reliable data transfer for both full and incremental loads, and can be orchestrated with schedulers such as Dolphin Scheduler or Azkaban.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
