Understanding ClickHouse: Architecture, Principles, and Performance
This article introduces ClickHouse, an open‑source columnar OLAP database, explains its architecture—including columnar storage, block processing, LSM, indexing and vectorized execution—highlights its performance advantages over other engines, and discusses its limitations such as write‑amplification, concurrency constraints, and ZooKeeper dependency.
ClickHouse Overview
ClickHouse is an open‑source column‑oriented database management system created by Yandex in 2016 for online analytical processing (OLAP) scenarios, written in C++ and accessed via SQL.
Core Design Concepts
ClickHouse adopts a columnar storage model, storing each column separately so that queries needing only a few columns read far less data than row‑store systems. Data is processed in block s—groups of rows (default up to 8192)—which are compressed with LZ4, enabling efficient batch processing and reducing I/O operations.
The system uses an LSM‑like approach: incoming data is first written to memory, sorted, and then flushed to disk as immutable files; periodic merges create larger sorted files, providing pre‑ordering that further cuts disk reads.
Indexing combines a sparse primary index (recording the first row of each block) with secondary skip indexes that store aggregated information, allowing queries to prune large data ranges without scanning every row.
Vectorized execution leverages CPU SIMD instructions (e.g., SSE4.2) to perform the same operation on multiple data items simultaneously, dramatically speeding up computation compared with traditional row‑by‑row loops.
Performance Highlights
Benchmark comparisons show ClickHouse delivering 2.3× the speed of Presto, 3× Impala, 7× Greenplum, and up to 48× Hive for single‑table SQL queries, largely due to its columnar layout, block compression, and parallel processing (MPP architecture).
Known Limitations
High‑frequency real‑time writes generate many small files, leading to merge overhead and degraded query performance; the recommended pattern is batch writes of larger volumes.
ClickHouse’s parallel query execution can consume half of the CPU per query, causing concurrency limits (e.g., max_concurrent_queries ) and potential “too many simultaneous queries” errors under heavy load.
The open‑source version relies on ZooKeeper for replica coordination; heavy ZooKeeper load can become a bottleneck, prompting some users to replace it with Raft‑based solutions.
Resource management is limited to per‑user memory caps; exceeding thresholds results in query termination, so external resource‑group components are often added to enforce finer‑grained controls.
Conclusion
Understanding ClickHouse’s architectural choices—columnar storage, block processing, LSM‑style writes, sparse indexing, and vectorized execution—helps practitioners exploit its strengths while mitigating drawbacks such as write amplification, concurrency limits, and external coordination dependencies.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.