Databases 13 min read

Can ClickHouse Distributed Tables Outperform Single-Node Tables? A Real-World Benchmark

This article presents a systematic benchmark comparing ClickHouse local (single‑node) tables and distributed tables across three data volumes—≈60 billion, 5 billion and 50 million rows—using a variety of aggregation and filter queries, and reveals that distributed tables dominate at large scale while the gap narrows as the dataset shrinks.

ITPUB

May 21, 2024

Can ClickHouse Distributed Tables Outperform Single-Node Tables? A Real-World Benchmark

In response to a claim that ClickHouse (CK) local tables are "invincible" compared to distributed tables, the author designs a comprehensive performance test to verify the statement.

Test Design

The experiment uses the same CSV‑based web‑log dataset and creates two identical schemas: one as a local table on a single server and another as a distributed table spanning three identical nodes. The hardware of the single CK server is 128 GB RAM, a 56‑thread E5‑2680 CPU, SATA RAID‑0 disks, and a 1 Gbps NIC.

Three data‑volume scenarios are prepared:

Large: ~6 billion rows

Medium: 500 million rows

Small: 50 million rows

For each volume the same set of queries is executed on both table types, and execution time is recorded.

Schema Definition

CREATE TABLE table_name (
    `client_ip` String,
    `domain` String,
    `time` String,
    `target_ip` String,
    `rcode` String,
    `query_type` String,
    `authority_record` String,
    `add_msg` String,
    `dns_ip` String
) ENGINE = MergeTree
PRIMARY KEY client_ip
ORDER BY client_ip

Queries Executed

All queries are written in plain ClickHouse SQL; key examples include:

SELECT count(distinct client_ip) FROM table_name

SELECT lower(domain) AS domain, count(domain) AS count
FROM table_name
WHERE target_ip = ''
GROUP BY domain

SELECT client_ip, count(client_ip) AS count
FROM table_name
GROUP BY client_ip
ORDER BY count DESC
LIMIT 10

SELECT target_ip, count(target_ip) AS count
FROM table_name
WHERE (client_ip = '192.168.200.124') AND (isIPv4String(target_ip) = 1)
GROUP BY target_ip
ORDER BY count DESC
LIMIT 10

SELECT *
FROM table_name
WHERE (client_ip = '1.0.125.208')
  AND (domain = 'PuLL-hls-F6.DOuYiNcdN.COM.')
  AND (time = '20230522072154')

A complex window‑function query is also included to find the top‑10 client IPs with the most consecutive minute‑level accesses.

Results – Large Data Volume (~60 billion rows)

For the distinct‑client‑IP count, the distributed table finishes in ~7 seconds versus ~14 seconds for the local table (≈2× faster). For the domain‑group‑by query, the distributed table is ~3× faster (232 s vs 726 s). For the top‑10 client‑IP query, the distributed table is ~2.7× faster (0.12 s vs 0.29 s). Even the simplest filter query shows the distributed table slightly ahead (0.18 s vs 0.40 s). The conclusion: distributed tables win decisively on massive data.

Results – Medium Data Volume (5 billion rows)

Execution times shrink for both table types, but the distributed table remains consistently faster. Example: distinct‑client‑IP count takes ~1 s (distributed) vs ~1.6 s (local). The domain‑group‑by query drops to 31 s (distributed) vs 101 s (local). The top‑10 client‑IP query is 0.46 s vs 0.89 s. Overall, the performance gap narrows but the distributed table still leads.

Results – Small Data Volume (50 million rows)

When the dataset is small, the timing difference becomes marginal. Distinct‑client‑IP count: 0.24 s (distributed) vs 0.28 s (local). Domain‑group‑by: 4.6 s (distributed) vs 12.9 s (local) – still a win for distributed. Top‑10 client‑IP query: 0.12 s (distributed) vs 0.15 s (local). Simple filter queries are virtually identical (0.15 s vs 0.24 s). The author notes that at this scale the two approaches are almost indistinguishable.

Conclusions

The benchmark confirms the initial hypothesis: for ClickHouse tables exceeding the hundred‑million‑row range, distributed tables provide a clear performance advantage; for smaller tables the benefit diminishes and local tables may be sufficient. The author also mentions an anomaly in one complex query where the distributed result differed from the local result, flagging it for future investigation.

Overall recommendation: use distributed tables when a single‑node table would hold > 1 billion rows; otherwise, a local table is acceptable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance SQL ClickHouse benchmark Distributed Tables Local Tables

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Test Design

Schema Definition

Queries Executed

Results – Large Data Volume (~60 billion rows)

Results – Medium Data Volume (5 billion rows)

Results – Small Data Volume (50 million rows)

Conclusions

ITPUB

How this landed with the community

Was this worth your time?

0 Comments

Results – Large Data Volume (~60 billion rows)

Results – Medium Data Volume (5 billion rows)

Results – Small Data Volume (50 million rows)