Databases 18 min read

ClickHouse: Architecture, Core Features, and Limitations for Interactive Analytics

ClickHouse is a PB‑scale, open‑source columnar OLAP database that uses a ZooKeeper‑coordinated sharded cluster, columnar storage, vectorized execution, advanced compression, data‑skipping indexes, and materialized views to deliver high‑performance interactive analytics, yet it requires manual shard management, lacks a mature MPP optimizer, and handles real‑time single‑row writes poorly.

Tencent Cloud Developer

Oct 20, 2020

ClickHouse: Architecture, Core Features, and Limitations for Interactive Analytics

1. Interactive Analytics Overview

Interactive analytics, also known as OLAP (Online Analytical Processing), enables users to perform multi‑dimensional, ad‑hoc statistical analysis on massive datasets, supporting reporting, real‑time dashboards, and on‑the‑fly queries.

Typical characteristics include read‑heavy workloads, append‑only writes, no strong transactional guarantees, and queries that scan many rows but only a subset of columns.

2. ClickHouse in the Landscape

ClickHouse is a PB‑scale, open‑source analytical database originally developed by Yandex for Yandex.Metrica. Since its open‑source release in 2016, it has gained rapid adoption (12.4K Stars on GitHub) and is used by companies such as Cisco, Splunk, Tencent, and Alibaba.

Benchmark results show ClickHouse leading most OLAP workloads, outperforming Vertica by 6× and Greenplum by 18×.

3. ClickHouse Architecture

3.1 Cluster Architecture

ClickHouse uses a sharded distributed architecture coordinated by ZooKeeper. Each shard contains multiple node processes (replicas) for reliability.

3.2 Data Model

Logical Model

Clusters contain databases, databases contain tables. Users first create local tables on each shard (replicas) and then a distributed table that maps to those local tables.

# 首先，创建本地表
CREATE TABLE table_local ON CLUSTER cluster_test (
    OrderKey UInt32,        # 列定义
    OrderDate Date,
    Quantity UInt8,
    TotalPrice UInt32,
    ……
) ENGINE = MergeTree()                    # 表引擎
PARTITION BY toYYYYMM(OrderDate)        # 分区方式
ORDER BY (OrderDate, OrderKey);         # 排序方式
SETTINGS index_granularity = 8192;       # 数据块大小

# 然后，创建分布式表
CREATE TABLE table_distribute ON CLUSTER cluster_test AS table_local
ENGINE = Distributed(cluster_test, default, table_local, rand());   # 关系映射引擎

Physical Model

Data is partitioned by the PARTITION BY column, stored in column‑wise files (.bin for data, .mrk2 for indexes). Within each partition, rows are sorted by the ORDER BY key, and data is stored in blocks (default 8192 rows) with a sparse primary‑key index for fast block pruning.

4. Core Features

4.1 Columnar Storage

Only the required columns are read from disk, reducing I/O by up to tenfold compared with row‑store systems.

4.2 Vectorized Execution

Operations are performed on batches of rows (vectors) using SIMD instructions, dramatically lowering function‑call overhead and cache misses.

4.3 Encoding & Compression

ClickHouse supports generic compressors (LZ4, ZSTD) and specialized codecs (Delta, DoubleDelta, Gorilla) that achieve compression ratios around 10:1.

4.4 Multiple Indexes

Besides the primary‑key sparse index, ClickHouse offers a variety of data‑skipping indexes to prune unnecessary rows during query execution.

4.5 Materialized Views (Cube/Rollup)

ClickHouse combines ROLAP flexibility with MOLAP performance by using MergeTree engines and MATERIALIZED VIEW statements to pre‑aggregate data.

4.6 Other Features

SQL dialect compatible with ANSI SQL; JDBC/ODBC support.

Role‑Based Access Control (RBAC).

Multi‑node, multi‑core parallel execution.

Approximate query algorithms and data sampling.

Colocated Join for sharded tables with identical sharding keys.

5. Limitations

5.1 Distributed Control

Users must manually manage shards and replicas; there is no automatic data rebalancing when scaling out, and fault recovery requires manual node replacement.

5.2 Compute Engine

ClickHouse lacks a mature MPP engine and optimizer for complex multi‑table joins or nested sub‑queries, and does not provide built‑in UDF support.

5.3 Real‑time Writes

The system is optimized for batch ingestion (≥ 1000 rows per batch). Real‑time single‑row writes generate many small files, degrading performance; a buffering layer is recommended.

6. Conclusion

ClickHouse delivers exceptional OLAP performance through its columnar storage, vectorized execution, and advanced compression, making it a leading choice for PB‑scale interactive analytics. While it still has gaps in distributed management, compute optimization, and real‑time ingestion, ongoing development aims to address these shortcomings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance distributed architecture ClickHouse OLAP data compression Columnar Storage Materialized Views

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.