Big Data 15 min read

Comparative Analysis of Elasticsearch and ClickHouse with Deployment Guide for a Private Data Platform

This article compares Elasticsearch and ClickHouse on write throughput, query speed, and cost, then provides a detailed deployment guide for Zookeeper, Kafka, Filebeat, and ClickHouse, including configuration snippets and solutions to common setup issues.

Architecture Digest

Sep 4, 2022

Comparative Analysis of Elasticsearch and ClickHouse with Deployment Guide for a Private Data Platform

The future of SaaS services faces data security and compliance challenges, prompting the need for a private deployment capability to enhance industry competitiveness.

To improve platform capabilities, a data system is required for operational analysis and activity effect measurement, while avoiding the high server overhead of a full big‑data stack.

Elasticsearch vs ClickHouse

ClickHouse, a high‑performance columnar distributed DBMS, offers several advantages:

High write throughput : Single‑server log ingestion reaches 50‑200 MB/s (over 600 k records/s), more than five times Elasticsearch, with fewer write rejections.

Fast queries : Page‑cache queries achieve 2‑30 GB/s; overall query speed is 5‑30× faster than Elasticsearch.

Lower server cost : Higher compression (1/3‑1/30 of Elasticsearch) reduces disk usage and I/O, and lower memory/CPU consumption can halve server costs.

Images illustrate performance and cost comparisons.

Cost Analysis

Based on Aliyun pricing without discounts, the analysis shows significant savings when using ClickHouse.

Environment Deployment

1. Zookeeper Cluster Deployment

yum install java-1.8.0-openjdk-devel.x86_64</code><code>/etc/profile 配置环境变量</code><code>yum install ntpdate</code><code>ntpdate asia.pool.ntp.org</code><code>mkdir zookeeper</code><code>mkdir ./zookeeper/data</code><code>mkdir ./zookeeper/logs</code><code>wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.7.1/apache-zookeeper-3.7.1-bin.tar.gz</code><code>tar -zvxf apache-zookeeper-3.7.1-bin.tar.gz -C /usr/zookeeper</code><code>export ZOOKEEPER_HOME=/usr/zookeeper/apache-zookeeper-3.7.1-bin</code><code>export PATH=$ZOOKEEPER_HOME/bin:$PATH</code><code>cd $ZOOKEEPER_HOME/conf</code><code>vi zoo.cfg</code><code>tickTime=2000</code><code>initLimit=10</code><code>syncLimit=5</code><code>dataDir=/usr/zookeeper/data</code><code>dataLogDir=/usr/zookeeper/logs</code><code>clientPort=2181</code><code>server.1=zk1:2888:3888</code><code>server.2=zk2:2888:3888</code><code>server.3=zk3:2888:3888</code><code>echo "1" > /usr/zookeeper/data/myid</code><code>echo "2" > /usr/zookeeper/data/myid</code><code>echo "3" > /usr/zookeeper/data/myid</code><code>cd $ZOOKEEPER_HOME/bin</code><code>sh zkServer.sh start

2. Kafka Cluster Deployment

mkdir -p /usr/kafka</code><code>chmod 777 -R /usr/kafka</code><code>wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/kafka/3.2.0/kafka_2.12-3.2.0.tgz</code><code>tar -zvxf kafka_2.12-3.2.0.tgz -C /usr/kafka</code><code>broker.id=1</code><code>listeners=PLAINTEXT://ip:9092</code><code>socket.send.buffer.bytes=102400</code><code>socket.receive.buffer.bytes=102400</code><code>socket.request.max.bytes=104857600</code><code>log.dir=/usr/kafka/logs</code><code>num.partitions=5</code><code>num.recovery.threads.per.data.dir=3</code><code>offsets.topic.replication.factor=2</code><code>transaction.state.log.replication.factor=3</code><code>transaction.state.log.min.isr=3</code><code>log.retention.hours=168</code><code>log.segment.bytes=1073741824</code><code>log.retention.check.interval.ms=300000</code><code>zookeeper.connect=zk1:2181,zk2:2181,zk3:2181</code><code>zookeeper.connection.timeout.ms=30000</code><code>group.initial.rebalance.delay.ms=0</code><code>nohup /usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-start.sh /usr/kafka/kafka_2.12-3.2.0/config/server.properties >/usr/kafka/logs/kafka.log 2>&1 &</code><code>/usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-stop.sh</code><code>$KAFKA_HOME/bin/kafka-topics.sh --list --bootstrap-server ip:9092</code><code>$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server ip:9092 --topic test --from-beginning</code><code>$KAFKA_HOME/bin/kafka-topics.sh --create --bootstrap-server ip:9092 --replication-factor 2 --partitions 3 --topic xxx_data

3. FileBeat Deployment

sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch</code><code>cat > /etc/yum.repos.d/elastic.repo <<EOF</code><code>[elastic-8.x]</code><code>name=Elastic repository for 8.x packages</code><code>baseurl=https://artifacts.elastic.co/packages/8.x/yum</code><code>gpgcheck=1</code><code>gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch</code><code>enabled=1</code><code>autorefresh=1</code><code>type=rpm-md</code><code>EOF</code><code>yum install filebeat</code><code>systemctl enable filebeat</code><code>chkconfig --add filebeat</code><code># filebeat.yml excerpt</code><code>filebeat.inputs:</code><code>- type: log</code><code>  enabled: true</code><code>  paths:</code><code>    - /root/logs/xxx/inner/*.log</code><code>  json:</code><code>    keys_under_root: true</code><code>output.kafka:</code><code>  hosts: ["kafka1:9092","kafka2:9092","kafka3:9092"]</code><code>  topic: 'xxx_data_clickhouse'</code><code>  partition.round_robin:</code><code>    reachable_only: false</code><code>    required_acks: 1</code><code>    compression: gzip</code><code>processors:</code><code>  - drop_fields:</code><code>      fields: ["input","agent","ecs","log","metadata","timestamp"]</code><code>nohup ./filebeat -e -c /etc/filebeat/filebeat.yml > /user/filebeat/filebeat.log &

4. ClickHouse Deployment

grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"</code><code>mkdir -p /data/clickhouse</code><code># add host entries for clickhouse nodes</code><code>echo "10.190.85.92 bigdata-clickhouse-01" >> /etc/hosts</code><code>echo "10.190.85.93 bigdata-clickhouse-02" >> /etc/hosts</code><code>echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor</code><code>echo 0 | tee /proc/sys/vm/overcommit_memory</code><code>echo 'never' | tee /sys/kernel/mm/transparent_hugepage/enabled</code><code>yum install yum-utils</code><code>rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG</code><code>yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/stable/x86_64</code><code>yum -y install clickhouse-server clickhouse-client</code><code># modify /etc/clickhouse-server/config.xml to set <level>information</level></code><code># start/stop commands</code><code>sudo clickhouse stop</code><code>sudo clickhouse start

ClickHouse Table Creation and Issues

Creating a Kafka engine table, local replicated table, and distributed table involves specific SQL statements; common errors include missing macros for shard/replica and authentication failures. Solutions involve setting --stream_like_engine_allow_direct_select 1, configuring distinct shard names per node, and cleaning stale Zookeeper nodes.

CREATE TABLE default.kafka_clickhouse_inner_log ON CLUSTER clickhouse_cluster (log_uuid String, date_partition UInt32, event_name String, activity_name String, activity_type String, activity_id UInt16) ENGINE = Kafka SETTINGS kafka_broker_list='kafka1:9092,kafka2:9092,kafka3:9092', kafka_topic_list='data_clickhouse', kafka_group_name='clickhouse_xxx', kafka_format='JSONEachRow', kafka_row_delimiter='
', kafka_num_consumers=1;

CREATE TABLE default.bi_inner_log_local ON CLUSTER clickhouse_cluster (log_uuid String, date_partition UInt32, event_name String, activity_name String, credits_bring Int16, activity_type String, activity_id UInt16) ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/default/bi_inner_log_local/{shard}', '{replica}') PARTITION BY date_partition ORDER BY (event_name, date_partition, log_uuid) SETTINGS index_granularity = 8192;

CREATE TABLE default.bi_inner_log_all ON CLUSTER clickhouse_cluster AS default.bi_inner_log_local ENGINE = Distributed(clickhouse_cluster, default, bi_inner_log_local, xxHash32(log_uuid));

CREATE MATERIALIZED VIEW default.view_bi_inner_log ON CLUSTER clickhouse_cluster TO default.bi_inner_log_all AS SELECT log_uuid, date_partition, event_name, activity_name, credits_bring, activity_type, activity_id FROM default.kafka_clickhouse_inner_log;

After resolving the listed issues, the data pipeline from Kafka to ClickHouse functions correctly, demonstrating a cost‑effective, high‑performance analytics solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deployment Elasticsearch ZooKeeper data analysis clickhouse

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.