Databases 31 min read

Why ClickHouse Is the Ideal Choice for Massive Data Storage and Real‑Time Analytics

This article examines the massive‑scale data requirements of an activity‑tracking platform, compares MySQL, Elasticsearch and HBase, and explains why ClickHouse—with its columnar storage, MergeTree engine, vectorized execution, and distributed architecture—offers the best combination of storage capacity, write performance, real‑time analysis, and query speed for billions of records.

JD Cloud Developers

Jul 19, 2023

Why ClickHouse Is the Ideal Choice for Massive Data Storage and Real‑Time Analytics

Background

The Magic Flute activity platform needs to record user behavior data for customer service, operations, product, and R&D, requiring storage and analysis of tens to hundreds of billions of rows, with real‑time write performance and analytical capabilities.

Store massive data

High write performance

Support real‑time analytics

Good query performance

Technology Options

2.1 MySQL Single Table

MySQL is single‑node; a single table can hold from a few hundred thousand to a few hundred million rows, far below the required tens of billions, so it is unsuitable.

2.2 MySQL Sharding

Sharding increases capacity and write throughput but adds resource consumption, complex OLAP queries, and latency due to distributed aggregation; it still struggles with tens of billions of rows.

2.3 Elasticsearch

Distributed search engine with inverted index, capable of storing large data and real‑time analysis, but suffers from write throughput bottlenecks, high latency, and higher storage cost.

2.4 HBase

Column‑family store built on HDFS, good write performance, but not a real‑time analytics engine, has limited query performance and high read latency for analytical workloads.

2.5 ClickHouse

ClickHouse is a high‑speed, scalable, low‑cost columnar DBMS supporting OLAP, real‑time analytics, and massive data volumes. It uses SQL, supports many data formats, and offers excellent compression and partitioning.

ClickHouse Detailed Introduction

3.1 Origin

Developed by Yandex in 2016, written in C++, originally for Yandex.Metrica click‑stream analytics.

3.2 Architecture

Multi‑Master architecture with equal‑role nodes; supports both single‑master and multi‑master modes, horizontal scaling via segment nodes, and automatic sharding.

3.3 Features

3.3.1 Columnar Storage

Compared with row storage, columnar storage stores each column’s values contiguously, reducing I/O for queries that need only a subset of columns and enabling better compression.

3.3.2 Complete DBMS Functions

DDL/DML support

Fine‑grained permission control

Backup and restore

Distributed cluster management

3.3.3 Data Compression

Default LZ4 compression achieves up to 8:1 ratio; compression works on blocks of 8192 rows, reducing CPU overhead.

压缩前：ABCDE_BCD 压缩后：ABCDE_(5,3)

3.3.4 Vectorized Execution Engine

Eliminates loops by using SIMD instructions; example:

for (size_t i = 0; i < 100; ++i) c[i] = a[i] + b[i];

c[0] = a[0] + b[0];
c[1] = a[1] + b[1];
...

3.3.5 Sharding and Distributed Queries

Data is horizontally sharded across nodes; local tables hold data shards, while Distributed tables act as proxies to query all shards in parallel.

3.3.6 Multithreading

Combines vectorized execution (data‑level parallelism) with multithreaded processing (core‑level parallelism) for high throughput.

3.3.7 Relational Model and Standard SQL

Supports standard SQL (SELECT, GROUP BY, JOIN, etc.) and many storage engines; MergeTree family provides primary key, partitioning, and sampling.

3.5 Table Engine – MergeTree

MergeTree supports primary‑key sorting, partitioning, replication, and sampling. Example creation:

CREATE TABLE IF NOT EXISTS db.table (
    col1 UInt64,
    col2 String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (col1, date);

Key parameters: PARTITION BY, ORDER BY, optional PRIMARY KEY, SAMPLE BY, and SETTINGS such as index_granularity.

3.6 Indexes

3.6.1 Primary (Sparse) Index

One index entry per index_granularity rows (default 8192), stored in primary.idx, enabling fast range scans.

3.6.2 Secondary (Skip) Indexes

Optional skip indexes (minmax, set, ngrambf_v1, tokenbf_v1) reduce scan ranges further.

3.6.3 Index Granularity

Controls both primary index and data block size; smaller granularity gives finer‑grained pruning at the cost of more index overhead.

3.7 Query and Write Paths

Queries use partition, primary, and secondary indexes plus mark files to minimize data read. Writes generate new partition directories, primary index entries, and compressed data blocks; background merges compact data.

ClickHouse Application Scenarios

Data warehouses and BI reporting

Real‑time analytics and monitoring

Time‑series storage

Data visualization dashboards

Pros and Cons

Advantages

High‑speed columnar storage with strong compression

Vectorized execution and multithreading

Distributed architecture with automatic sharding

Rich SQL support and mature DBMS features

Drawbacks

No full ACID transactions

No built‑in full‑text tokenization

Limited low‑latency row‑level updates/deletes

Joins are not a strength

Not optimized for very high QPS workloads (recommended ~100 QPS)

Why Queries Are Fast

ClickHouse minimizes disk I/O through columnar storage, pre‑sorting, compression, vectorized functions, and avoiding joins; it performs GROUP BY in memory using hash tables and leverages multithreading and distributed execution.

Why Writes Are Fast

Writes are appended sequentially in large blocks (LSM‑tree‑like), compressed, and merged in the background, achieving 50‑200 MB/s (≈500 k‑2 M rows/s) on commodity hardware.

-end-

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics ClickHouse Data Warehouse OLAP scalable storage Columnar Database MergeTree

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.