Databases 29 min read

How ClickHouse Executes GROUP BY and Handles Real‑Time Analytics on Billions of Rows

This article explains ClickHouse’s core architecture—including its storage‑compute integration, MPP parallelism, columnar storage, vectorized execution, data pre‑sorting, table engines, sparse and auxiliary indexes, and the two‑stage aggregation pipeline—then walks through the exact GROUP BY execution flow for both local and distributed tables, illustrating each step with diagrams, SQL demos, and code snippets.

Tech Freedom Circle

Sep 1, 2025

How ClickHouse Executes GROUP BY and Handles Real‑Time Analytics on Billions of Rows

ClickHouse Core Architecture Overview

ClickHouse is a distributed column‑oriented OLAP database whose performance stems from two design pillars: storage‑compute integration and massively parallel processing (MPP) . The system stores data locally in a columnar format, allowing queries to read only the required columns, and executes queries in parallel across all nodes.

1. Storage Layer vs. Compute Layer

The storage layer handles data persistence, compression, and pre‑sorting, while the compute layer performs query planning and execution. Unlike traditional engines (e.g., Spark) that separate storage and compute, ClickHouse’s unified design reduces I/O overhead.

2. MPP Parallelism

Each node in a ClickHouse cluster is equal; queries are split into sub‑tasks that run concurrently on all shards, fully utilizing multi‑core CPUs.

Key Performance Optimizations

2.1 Columnar Storage

Data is stored per column, so only the columns needed for a query are read. This also improves compression ratios (often 8:1).

2.2 Vectorized Execution

Rows are processed in fixed‑size blocks (e.g., 1024 rows) using SIMD instructions (AVX2, AVX‑512) to achieve high CPU utilization.

2.3 Data Pre‑Sorting (MergeTree)

When data is inserted, it is sorted in memory and later merged into ordered partitions. Sorted data enables fast range scans (e.g., WHERE date > '2024‑01‑01').

Table Engines

MergeTree : Primary engine for massive structured tables; supports pre‑sorting and partitioning.

Distributed : Logical engine that routes queries to underlying MergeTree tables on each shard; does not store data itself.

Memory , Log : Specialized engines for temporary or log data.

The Distributed engine works together with MergeTree : the former provides a unified entry point, while the latter holds the actual data.

Index Design

Sparse (primary) index : Stores the first row of each granule (default 8192 rows) in primary.idx and the granule’s file offset in .mrk. Queries first locate granules via this index, dramatically reducing scanned rows.

Auxiliary indexes (minmax, bloom filter, set): Optional indexes that further skip blocks when the primary index does not filter enough.

Aggregation Engine

ClickHouse’s aggregation consists of two phases:

Pre‑aggregate : Multiple threads process data partitions in parallel, inserting intermediate results into a thread‑local HashMap.

Final‑aggregate : A single thread merges all thread‑local maps. To avoid the single‑thread bottleneck, ClickHouse can automatically switch to a Two‑Level HashMap (256 buckets) when the number of groups exceeds group_by_two_level_threshold, enabling parallel merging.

The engine chooses the appropriate hash map implementation based on key characteristics (e.g., FixedHashMap for UInt8, StringHashMap for strings, TwoLevelHashMap for high‑cardinality keys).

GROUP BY Execution Flow

Local Table (single node)

Filter & read : Use partition and primary‑key indexes to read only the needed columns.

Group calculation : For low cardinality (< 1 M groups) use an in‑memory hash table; for very high cardinality switch to sorting‑based aggregation.

Result output : Return the grouped rows directly.

Distributed Table (multiple nodes)

Local pre‑aggregation on each shard (same steps as the local table).

Global aggregation on the coordinating node: merge partial results; decomposable functions (SUM, COUNT) are summed, while non‑decomposable functions (DISTINCT COUNT) are recomputed.

Result return to the client.

Key optimizations include partition pruning, parallel pre‑aggregation, two‑level hash maps, and using approximate functions (e.g., approx_count_distinct) to reduce work.

Demo

Calculate total order amount per user for January 2024 on a Distributed table orders (partitioned by create_date, primary key (user_id, create_date), shard key user_id):

SELECT
  user_id,
  SUM(order_amount) AS total_amount
FROM orders
WHERE create_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY user_id
SETTINGS
  group_by_algorithm = 'hash',
  group_by_two_level_threshold = 100000;

The coordinator splits the query to all shards, each shard performs the two‑step aggregation described above, and the coordinator merges the partial results, producing the final per‑user totals.

Performance Impact

Partition pruning reduces I/O by > 90 % for time‑range queries.

Parallel pre‑aggregation cuts execution time from 8 s to 3 s on a node with group_by_two_level_threshold=100000.

Local pre‑aggregation shrinks data transferred between nodes from billions of rows to a few hundred thousand groups, saving > 99 % of network traffic.

Conclusion

ClickHouse achieves real‑time analytics on billions of rows by tightly coupling storage and compute, leveraging columnar layout, vectorized processing, intelligent indexing, and a flexible two‑level aggregation architecture that scales from a single node to large clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ClickHouse hashmap index Columnar Storage MPP Distributed Query GROUP BY

Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.