Databases 12 min read

Understanding ClickHouse: Architecture, Principles, and Performance

This article introduces ClickHouse, an open‑source columnar OLAP database, explains its architecture—including columnar storage, block processing, LSM, indexing and vectorized execution—highlights its performance advantages over other engines, and discusses its limitations such as write‑amplification, concurrency constraints, and ZooKeeper dependency.

JD Tech

Jan 18, 2024

Understanding ClickHouse: Architecture, Principles, and Performance

ClickHouse Overview

ClickHouse is an open‑source column‑oriented database management system created by Yandex in 2016 for online analytical processing (OLAP) scenarios, written in C++ and accessed via SQL.

Core Design Concepts

ClickHouse adopts a columnar storage model, storing each column separately so that queries needing only a few columns read far less data than row‑store systems. Data is processed in block s—groups of rows (default up to 8192)—which are compressed with LZ4, enabling efficient batch processing and reducing I/O operations.

The system uses an LSM‑like approach: incoming data is first written to memory, sorted, and then flushed to disk as immutable files; periodic merges create larger sorted files, providing pre‑ordering that further cuts disk reads.

Indexing combines a sparse primary index (recording the first row of each block) with secondary skip indexes that store aggregated information, allowing queries to prune large data ranges without scanning every row.

Vectorized execution leverages CPU SIMD instructions (e.g., SSE4.2) to perform the same operation on multiple data items simultaneously, dramatically speeding up computation compared with traditional row‑by‑row loops.

Performance Highlights

Benchmark comparisons show ClickHouse delivering 2.3× the speed of Presto, 3× Impala, 7× Greenplum, and up to 48× Hive for single‑table SQL queries, largely due to its columnar layout, block compression, and parallel processing (MPP architecture).

Known Limitations

High‑frequency real‑time writes generate many small files, leading to merge overhead and degraded query performance; the recommended pattern is batch writes of larger volumes.

ClickHouse’s parallel query execution can consume half of the CPU per query, causing concurrency limits (e.g., max_concurrent_queries) and potential “too many simultaneous queries” errors under heavy load.

The open‑source version relies on ZooKeeper for replica coordination; heavy ZooKeeper load can become a bottleneck, prompting some users to replace it with Raft‑based solutions.

Resource management is limited to per‑user memory caps; exceeding thresholds results in query termination, so external resource‑group components are often added to enforce finer‑grained controls.

Conclusion

Understanding ClickHouse’s architectural choices—columnar storage, block processing, LSM‑style writes, sparse indexing, and vectorized execution—helps practitioners exploit its strengths while mitigating drawbacks such as write amplification, concurrency limits, and external coordination dependencies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ClickHouse OLAP Columnar Database

Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.