ClickHouse Overview: Features, Performance, Engines, and Comparison with Hadoop
This article introduces ClickHouse as a high‑performance, column‑oriented database designed for real‑time big‑data analytics, outlines its key features, performance characteristics, supported interfaces, differences from Hadoop, and explains its main storage engines—MergeTree and Distributed—while also noting its current limitations.
Background
When discussing big data, Hadoop is often mentioned, but the traditional Hadoop stack (HDFS + MapReduce) mainly serves offline workloads with limited timeliness (typically T+1). ClickHouse was created to address the need for low‑latency processing of massive data volumes.
1. Features
Column‑ariented storage
Data compression
Disk‑based storage that avoids excessive memory consumption
High CPU utilization, leveraging all CPU cores during queries
Sharding support with parallel execution across shards
SQL support, lowering the entry barrier for big‑data analysis
Join capabilities
Real‑time data updates
Automatic multi‑replica synchronization
Index support
Distributed query execution
2. Performance
Low latency (≈50 ms) for short queries on cached data with primary key lookup
Supports high concurrency; recommended 100 queries per second for short queries
Write speed of 50‑200 MB/s with the MergeTree engine (≈50 k‑200 k rows per second for 1 KB records)
3. Interfaces
External HTTP and JDBC interfaces
Internal modules communicate via TCP connections
4. Differences from Hadoop
Hadoop is primarily an offline system; ClickHouse supports ad‑hoc (instant) queries
Hadoop does not support real‑time updates; ClickHouse does
Hadoop stores rows, requiring full‑column scans; ClickHouse stores columns, reading only needed columns
Engine Overview
ClickHouse offers a variety of storage engines; the two most important are MergeTree and Distributed.
1. MergeTree
MergeTree is the most advanced engine in ClickHouse and serves as the basis for a family of related engines.
Features
Supports primary‑key and date indexes
Provides real‑time data updates
Optimized for fast reads and writes
Requires a Date‑type column for default time‑based partitioning
Partitioning
Default partitioning is monthly; data from the same month is never merged across partitions
Each partition is stored in separate sub‑folders
New data is written to new folders; background threads periodically merge them
Each folder contains data sorted by primary key and an index file for that folder
Partition Enhancements
Older versions only allowed month‑level partitions
Since version 1.1.54310, custom partitioning is supported
Partition information can be inspected via the system.parts table
Example: a table storing data for January, February, and March 2018 is physically split into three partitions (201801, 201802, 201803) across four folders, as shown in the following diagrams.
Indexes
Each partition sub‑folder has its own index
Indexes are used when WHERE clauses involve equality, inequality, range, IN, or boolean conditions on indexed columns or the Date column
LIKE predicates do not use the index
Example of a non‑indexed query:
SELECT count() FROM table WHERE CounterID = 34 OR URL LIKE '%upyachka%'For date indexes, queries run only on partitions containing the relevant dates
Specifying the primary key is recommended to avoid scanning large amounts of data within a partition
2. Distributed
The Distributed engine does not store actual data; it acts as a proxy that forwards reads and writes to underlying data nodes (e.g., MergeTree tables).
Typical cluster topology: a Distributed table in front of several MergeTree data nodes.
Distributed acts as a proxy, storing only the table schema
MergeTree acts as a data node, storing the actual rows
Distributed requires parameters:
<cluster_name, remote_database, remote_table, sharding_key>Cluster name: the current cluster identifier. Remote database name: the database where the MergeTree tables reside. Remote table name: the actual table storing data. Sharding rule: optional, defines how rows are distributed across shards. A Distributed table can be split into multiple shards; each shard can have multiple replicas with identical data.
Data Retrieval Process
Query is dispatched to remote shards and executed in parallel
If a replica fails, the query retries other replicas
Remote engines’ indexes are used during query execution
Aggregations are performed on data nodes first, then intermediate results are merged on the Distributed node
Data Ingestion Process
Direct write to data nodes (optimal, parallel writes across nodes)
Write to a Distributed table, which then routes data to the appropriate shard based on the sharding key
ClickHouse Limitations
No transaction support
Aggregations are limited by the RAM of a single machine
Documentation and operational guidance are relatively scarce, making maintenance challenging
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
