Big Data 11 min read

ClickHouse Overview: Features, Performance, Engines, and Comparison with Hadoop

This article introduces ClickHouse as a high‑performance, column‑oriented database designed for real‑time big‑data analytics, outlines its key features, performance characteristics, supported interfaces, differences from Hadoop, and explains its main storage engines—MergeTree and Distributed—while also noting its current limitations.

JD Tech

Jul 4, 2018

ClickHouse Overview: Features, Performance, Engines, and Comparison with Hadoop

Background

When discussing big data, Hadoop is often mentioned, but the traditional Hadoop stack (HDFS + MapReduce) mainly serves offline workloads with limited timeliness (typically T+1). ClickHouse was created to address the need for low‑latency processing of massive data volumes.

1. Features

Column‑ariented storage

Data compression

Disk‑based storage that avoids excessive memory consumption

High CPU utilization, leveraging all CPU cores during queries

Sharding support with parallel execution across shards

SQL support, lowering the entry barrier for big‑data analysis

Join capabilities

Real‑time data updates

Automatic multi‑replica synchronization

Index support

Distributed query execution

2. Performance

Low latency (≈50 ms) for short queries on cached data with primary key lookup

Supports high concurrency; recommended 100 queries per second for short queries

Write speed of 50‑200 MB/s with the MergeTree engine (≈50 k‑200 k rows per second for 1 KB records)

3. Interfaces

External HTTP and JDBC interfaces

Internal modules communicate via TCP connections

4. Differences from Hadoop

Hadoop is primarily an offline system; ClickHouse supports ad‑hoc (instant) queries

Hadoop does not support real‑time updates; ClickHouse does

Hadoop stores rows, requiring full‑column scans; ClickHouse stores columns, reading only needed columns

Engine Overview

ClickHouse offers a variety of storage engines; the two most important are MergeTree and Distributed.

1. MergeTree

MergeTree is the most advanced engine in ClickHouse and serves as the basis for a family of related engines.

Features

Supports primary‑key and date indexes

Provides real‑time data updates

Optimized for fast reads and writes

Requires a Date‑type column for default time‑based partitioning

Partitioning

Default partitioning is monthly; data from the same month is never merged across partitions

Each partition is stored in separate sub‑folders

New data is written to new folders; background threads periodically merge them

Each folder contains data sorted by primary key and an index file for that folder

Partition Enhancements

Older versions only allowed month‑level partitions

Since version 1.1.54310, custom partitioning is supported

Partition information can be inspected via the system.parts table

Example: a table storing data for January, February, and March 2018 is physically split into three partitions (201801, 201802, 201803) across four folders, as shown in the following diagrams.

Indexes

Each partition sub‑folder has its own index

Indexes are used when WHERE clauses involve equality, inequality, range, IN, or boolean conditions on indexed columns or the Date column

LIKE predicates do not use the index

Example of a non‑indexed query:

SELECT count() FROM table WHERE CounterID = 34 OR URL LIKE '%upyachka%'

For date indexes, queries run only on partitions containing the relevant dates

Specifying the primary key is recommended to avoid scanning large amounts of data within a partition

2. Distributed

The Distributed engine does not store actual data; it acts as a proxy that forwards reads and writes to underlying data nodes (e.g., MergeTree tables).

Typical cluster topology: a Distributed table in front of several MergeTree data nodes.

Distributed acts as a proxy, storing only the table schema

MergeTree acts as a data node, storing the actual rows

Distributed requires parameters:

<cluster_name, remote_database, remote_table, sharding_key>

Cluster name: the current cluster identifier. Remote database name: the database where the MergeTree tables reside. Remote table name: the actual table storing data. Sharding rule: optional, defines how rows are distributed across shards. A Distributed table can be split into multiple shards; each shard can have multiple replicas with identical data.

Data Retrieval Process

Query is dispatched to remote shards and executed in parallel

If a replica fails, the query retries other replicas

Remote engines’ indexes are used during query execution

Aggregations are performed on data nodes first, then intermediate results are merged on the Distributed node

Data Ingestion Process

Direct write to data nodes (optimal, parallel writes across nodes)

Write to a Distributed table, which then routes data to the appropriate shard based on the sharding key

ClickHouse Limitations

No transaction support

Aggregations are limited by the RAM of a single machine

Documentation and operational guidance are relatively scarce, making maintenance challenging

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems ClickHouse Hadoop Columnar Database

Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.