Big Data 12 min read

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

This article compares Elasticsearch and ClickHouse for log analytics, presents cost‑benefit calculations, and provides a step‑by‑step deployment guide for Zookeeper, Kafka, Filebeat, and ClickHouse to build a scalable, low‑cost data analysis platform for SaaS services.

Selected Java Interview Questions

Oct 23, 2022

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

Background

SaaS services will face data security and compliance challenges in the future. Our business needs a private‑deployment capability to improve industry competitiveness. To enhance platform capabilities we need a data system for operational analysis, but a full‑blown big‑data stack would impose heavy server costs, so we chose a balanced solution.

Elasticsearch vs ClickHouse

ClickHouse is a high‑performance column‑oriented distributed DBMS. Our tests revealed the following advantages over Elasticsearch:

Write throughput: a single server can ingest 50‑200 MB/s (over 600 k records/s), more than 5× the throughput of Elasticsearch, with far fewer write rejections and latency spikes.

Query speed: ClickHouse can achieve 2‑30 GB/s when data resides in page cache, and 5‑30× faster than Elasticsearch when reading from disk, depending on compression.

Server cost: ClickHouse’s higher compression (1/3‑1/30 of Elasticsearch) reduces disk usage and I/O, while its lower memory and CPU consumption can cut server costs by roughly 50%.

Cost Analysis

Cost estimates are based on Alibaba Cloud pricing without any discounts.

Environment Deployment

Zookeeper Cluster Deployment

Install Java and configure environment variables.

yum install java-1.8.0-openjdk-devel.x86_64
# /etc/profile configure environment variables

Synchronize system time.

yum install ntpdate
ntpdate asia.pool.ntp.org

mkdir zookeeper
mkdir ./zookeeper/data
mkdir ./zookeeper/logs

wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.7.1/apache-zookeeper-3.7.1-bin.tar.gz
tar -zvxf apache-zookeeper-3.7.1-bin.tar.gz -C /usr/zookeeper

export ZOOKEEPER_HOME=/usr/zookeeper/apache-zookeeper-3.7.1-bin
export PATH=$ZOOKEEPER_HOME/bin:$PATH

Enter the configuration directory and create zoo.cfg:

cd $ZOOKEEPER_HOME/conf

vi zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/zookeeper/data
dataLogDir=/usr/zookeeper/logs
clientPort=2181
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888

Create a myid file on each node:

echo "1" > /usr/zookeeper/data/myid
# on the second node
echo "2" > /usr/zookeeper/data/myid
# on the third node
echo "3" > /usr/zookeeper/data/myid

Start Zookeeper:

cd $ZOOKEEPER_HOME/bin
sh zkServer.sh start

Kafka Cluster Deployment

mkdir -p /usr/kafka
chmod 777 -R /usr/kafka
wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/kafka/3.2.0/kafka_2.12-3.2.0.tgz
tar -zvxf kafka_2.12-3.2.0.tgz -C /usr/kafka

Configure each broker (example for broker.id=1):

broker.id=1
listeners=PLAINTEXT://ip:9092
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dir=/usr/kafka/logs
num.partitions=5
num.recovery.threads.per.data.dir=3
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=3
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
zookeeper.connection.timeout.ms=30000
group.initial.rebalance.delay.ms=0

Run Kafka as a background daemon:

nohup /usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-start.sh /usr/kafka/kafka_2.12-3.2.0/config/server.properties > /usr/kafka/logs/kafka.log 2>&1 &

/usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-stop.sh

$KAFKA_HOME/bin/kafka-topics.sh --list --bootstrap-server ip:9092
$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server ip:9092 --topic test --from-beginning
$KAFKA_HOME/bin/kafka-topics.sh --create --bootstrap-server ip:9092 --replication-factor 2 --partitions 3 --topic xxx_data

FileBeat Deployment

sudo rpm --import https://packages.elastic.co/GPK-KEY-elasticsearch
# Create elastic.repo in /etc/yum.repos.d/
[elastic-8.x]
name=Elastic repository for 8.x packages
baseurl=https://artifacts.elastic.co/packages/8.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

yum install filebeat
systemctl enable filebeat
chkconfig --add filebeat

Key FileBeat configuration (ensure keys_under_root: true is set so Kafka fields are not nested under message).

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /root/logs/xxx/inner/*.log
  json:
    keys_under_root: true
output.kafka:
  hosts: ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
  topic: 'xxx_data_clickhouse'
  partition.round_robin:
    reachable_only: false
  required_acks: 1
  compression: gzip
processors:
- drop_fields:
    fields: ["input", "agent", "ecs", "log", "metadata", "timestamp"]
    ignore_missing: false

nohup ./filebeat -e -c /etc/filebeat/filebeat.yml > /user/filebeat/filebeat.log &

ClickHouse Deployment

Check CPU for SSE 4.2 support:

grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

Create a data directory on a high‑capacity disk:

mkdir -p /data/clickhouse

Add ClickHouse host entries to /etc/hosts:

10.190.85.92 bigdata-clickhouse-01
10.190.85.93 bigdata-clickhouse-02

Optimize server performance:

echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

echo 0 | tee /proc/sys/vm/overcommit_memory

echo 'never' | tee /sys/kernel/mm/transparent_hugepage/enabled

Install ClickHouse from the official repository:

yum install yum-utils
rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG
yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/stable/x86_64

yum list | grep clickhouse

yum -y install clickhouse-server clickhouse-client

Set log level to information in /etc/clickhouse-server/config.xml:

<level>information</level>

Log locations:

Normal log: /var/log/clickhouse-server/clickhouse-server.log Error log: /var/log/clickhouse-server/clickhouse-server.err.log Verify ClickHouse version and manage the service:

clickhouse-server --version
clickhouse-client --password

sudo clickhouse stop
sudo clickhouse start

Conclusion

The deployment process involved many pitfalls, especially the FileBeat yml parameters. I will publish a follow‑up article detailing ClickHouse configuration issues. Beyond the technical work, continuous learning and output remain essential for building a personal moat, whether as a technical expert, architect, or manager.

If your company lacks strong industry influence, staying on the front line and later seeking new opportunities can be a pragmatic path; consider industry impact, commercial sense, and architectural skills when planning your career.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Deployment Elasticsearch Zookeeper Kafka ClickHouse filebeat

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.