Understanding ClickHouse: From OLAP Basics to Advanced Table Engines and Deployment
This guide explains ClickHouse fundamentals, OLAP versus OLTP concepts, columnar storage benefits, core performance techniques, the MergeTree family and its indexing, specialized table engines, installation on Linux, Docker deployment, and integration with HDFS, MySQL, and Kafka for modern analytical workloads.
What is ClickHouse?
ClickHouse is an open‑source column‑oriented DBMS for online analytical processing (OLAP) developed by Yandex and released in 2016. It is written in C++ and runs on Linux, FreeBSD and macOS.
OLAP vs OLTP
OLAP – read‑heavy analytical queries, multidimensional aggregation, decision‑support.
OLTP – transaction‑heavy CRUD operations, strong consistency, typical for banking or e‑commerce.
Key Characteristics of OLAP Workloads
Mostly read requests.
Large batches of rows (>1000) are written.
No updates to already inserted rows.
Queries read many rows but only a few columns.
Wide tables with many columns.
Low query concurrency (hundreds per second per server).
Latency tolerance around 50 ms for simple queries.
Data values are small (numbers, short strings).
High throughput required for single queries (hundreds of millions of rows per second).
Transactions are not mandatory; consistency requirements are low.
Why Columnar Storage Suits OLAP
Columnar layout allows reading only the columns required by a query, provides compression ratios of ten‑fold or higher, and reduces I/O because unrelated columns are never touched.
Performance Techniques in ClickHouse
Vectorized execution (SIMD) for data‑parallel computation, delivering several‑fold speedups.
Multithreaded and distributed processing; computation is pushed to the data nodes.
Multiple storage policies: default, JBOD (round‑robin across disks), and HOT/COLD (SSD for hot data, HDD for cold data).
MergeTree Family Overview
MergeTree is the core engine. Data is stored in partitions, each partition contains a set of data parts. A sparse primary index (primary.idx) is generated every index_granularity rows (default 8192). Columns are stored in separate .bin files with accompanying .mrk marker files.
Partitioning and Indexing
Partitions are defined with PARTITION BY. The index granularity controls the size of MarkRange blocks. Primary key can be declared with ORDER BY or PRIMARY KEY. The index directory name follows the pattern PartitionID_MinBlockNum_MaxBlockNum_Level.
Data Insertion and Compression
During insertion ClickHouse groups rows into batches of 64 KB–1 MB, compresses each column independently, and writes a .mrk file that records offsets for the corresponding .bin block. Small batches (<64 KB) are merged into the next block; batches larger than 1 MB are split.
Specialized MergeTree Engines
ReplacingMergeTree – removes duplicate rows during merge; optional version column determines which row survives.
SummingMergeTree – aggregates numeric columns on merge, useful for pre‑summarized data.
AggregatingMergeTree – stores aggregate function states (e.g., uniq, sum) and merges them; often paired with materialized views.
CollapsingMergeTree – supports row‑level updates and deletes using a signed sign column (1 = insert, -1 = delete).
VersionedCollapsingMergeTree – adds a version column to control collapse order across partitions.
Other Table Engines
HDFS – maps external CSV files stored in HDFS as a ClickHouse table.
MySQL – creates a remote table that reads from and writes to a MySQL table.
Kafka – consumes messages from Kafka topics as a virtual table.
Installation on Linux
ClickHouse requires a CPU with SSE 4.2 support. Example commands to verify support and install the Debian package:
$ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported" sudo apt-get install apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4
echo "deb https://repo.clickhouse.com/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get -y install clickhouse-server clickhouse-client
sudo service clickhouse-server startDocker Deployment
Run ClickHouse server and client containers:
docker pull yandex/clickhouse-server
docker pull yandex/clickhouse-client
docker run -d --name ch-server -p 8123:8123 -p 9000:9000 -p 9009:9009 yandex/clickhouse-server
docker run -it --rm yandex/clickhouse-client clickhouse-clientHDFS Setup Example
Deploy a three‑node HDFS cluster with Docker Compose (image singularities/hadoop) and create a directory for ClickHouse data.
# Pull Hadoop image
docker pull singularities/hadoop
# Start cluster (docker‑compose.yml omitted for brevity)
docker-compose up -d
# Scale datanodes to three
docker-compose scale datanode=3After the cluster is running, create a ClickHouse table that reads CSV files from HDFS:
CREATE TABLE hdfs_table10 (
id UInt32,
code String,
name String
) ENGINE = HDFS('hdfs://namenode:8020/clickhouse/hdfs_table10', 'CSV');MySQL Engine Example
CREATE TABLE mysql_demo (
id UInt32,
name String
) ENGINE = MySQL('127.0.0.1:3306', 'test', 'demo', 'root', '');
INSERT INTO mysql_demo VALUES (1, 'example');Kafka Engine Example
CREATE TABLE kafka_test (
id UInt32,
code String,
name String
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'broker1:9092',
kafka_topic_list = 'sales-queue',
kafka_group_name = 'chgroup',
kafka_format = 'JSONEachRow',
kafka_skip_broken_messages = 100;Best Practices
Avoid SELECT * in production; query only the columns you need because ClickHouse reads whole columns from disk, and selecting unnecessary columns can degrade performance by orders of magnitude.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
