Databases 30 min read

Understanding ClickHouse: From OLAP Basics to Advanced Table Engines and Deployment

This guide explains ClickHouse fundamentals, OLAP versus OLTP concepts, columnar storage benefits, core performance techniques, the MergeTree family and its indexing, specialized table engines, installation on Linux, Docker deployment, and integration with HDFS, MySQL, and Kafka for modern analytical workloads.

IT Architects Alliance

Jun 19, 2022

Understanding ClickHouse: From OLAP Basics to Advanced Table Engines and Deployment

What is ClickHouse?

ClickHouse is an open‑source column‑oriented DBMS for online analytical processing (OLAP) developed by Yandex and released in 2016. It is written in C++ and runs on Linux, FreeBSD and macOS.

OLAP vs OLTP

OLAP – read‑heavy analytical queries, multidimensional aggregation, decision‑support.

OLTP – transaction‑heavy CRUD operations, strong consistency, typical for banking or e‑commerce.

Key Characteristics of OLAP Workloads

Mostly read requests.

Large batches of rows (>1000) are written.

No updates to already inserted rows.

Queries read many rows but only a few columns.

Wide tables with many columns.

Low query concurrency (hundreds per second per server).

Latency tolerance around 50 ms for simple queries.

Data values are small (numbers, short strings).

High throughput required for single queries (hundreds of millions of rows per second).

Transactions are not mandatory; consistency requirements are low.

Why Columnar Storage Suits OLAP

Columnar layout allows reading only the columns required by a query, provides compression ratios of ten‑fold or higher, and reduces I/O because unrelated columns are never touched.

Performance Techniques in ClickHouse

Vectorized execution (SIMD) for data‑parallel computation, delivering several‑fold speedups.

Multithreaded and distributed processing; computation is pushed to the data nodes.

Multiple storage policies: default, JBOD (round‑robin across disks), and HOT/COLD (SSD for hot data, HDD for cold data).

MergeTree Family Overview

MergeTree is the core engine. Data is stored in partitions, each partition contains a set of data parts. A sparse primary index (primary.idx) is generated every index_granularity rows (default 8192). Columns are stored in separate .bin files with accompanying .mrk marker files.

Partitioning and Indexing

Partitions are defined with PARTITION BY. The index granularity controls the size of MarkRange blocks. Primary key can be declared with ORDER BY or PRIMARY KEY. The index directory name follows the pattern PartitionID_MinBlockNum_MaxBlockNum_Level.

Data Insertion and Compression

During insertion ClickHouse groups rows into batches of 64 KB–1 MB, compresses each column independently, and writes a .mrk file that records offsets for the corresponding .bin block. Small batches (<64 KB) are merged into the next block; batches larger than 1 MB are split.

Specialized MergeTree Engines

ReplacingMergeTree – removes duplicate rows during merge; optional version column determines which row survives.

SummingMergeTree – aggregates numeric columns on merge, useful for pre‑summarized data.

AggregatingMergeTree – stores aggregate function states (e.g., uniq, sum) and merges them; often paired with materialized views.

CollapsingMergeTree – supports row‑level updates and deletes using a signed sign column (1 = insert, -1 = delete).

VersionedCollapsingMergeTree – adds a version column to control collapse order across partitions.

Other Table Engines

HDFS – maps external CSV files stored in HDFS as a ClickHouse table.

MySQL – creates a remote table that reads from and writes to a MySQL table.

Kafka – consumes messages from Kafka topics as a virtual table.

Installation on Linux

ClickHouse requires a CPU with SSE 4.2 support. Example commands to verify support and install the Debian package:

$ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

sudo apt-get install apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4
echo "deb https://repo.clickhouse.com/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get -y install clickhouse-server clickhouse-client
sudo service clickhouse-server start

Docker Deployment

Run ClickHouse server and client containers:

docker pull yandex/clickhouse-server
docker pull yandex/clickhouse-client
docker run -d --name ch-server -p 8123:8123 -p 9000:9000 -p 9009:9009 yandex/clickhouse-server
docker run -it --rm yandex/clickhouse-client clickhouse-client

HDFS Setup Example

Deploy a three‑node HDFS cluster with Docker Compose (image singularities/hadoop) and create a directory for ClickHouse data.

# Pull Hadoop image
docker pull singularities/hadoop
# Start cluster (docker‑compose.yml omitted for brevity)
docker-compose up -d
# Scale datanodes to three
docker-compose scale datanode=3

After the cluster is running, create a ClickHouse table that reads CSV files from HDFS:

CREATE TABLE hdfs_table10 (
    id UInt32,
    code String,
    name String
) ENGINE = HDFS('hdfs://namenode:8020/clickhouse/hdfs_table10', 'CSV');

MySQL Engine Example

CREATE TABLE mysql_demo (
    id UInt32,
    name String
) ENGINE = MySQL('127.0.0.1:3306', 'test', 'demo', 'root', '');
INSERT INTO mysql_demo VALUES (1, 'example');

Kafka Engine Example

CREATE TABLE kafka_test (
    id UInt32,
    code String,
    name String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'broker1:9092',
    kafka_topic_list = 'sales-queue',
    kafka_group_name = 'chgroup',
    kafka_format = 'JSONEachRow',
    kafka_skip_broken_messages = 100;

Best Practices

Avoid SELECT * in production; query only the columns you need because ClickHouse reads whole columns from disk, and selecting unnecessary columns can degrade performance by orders of magnitude.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker SQL ClickHouse OLAP Columnar Storage data ingestion MergeTree Table Engines

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.