Databases 13 min read

Can ClickHouse Distributed Tables Outperform Single-Node Tables? A Real-World Benchmark

This article presents a systematic benchmark comparing ClickHouse local (single‑node) tables and distributed tables across three data volumes—≈60 billion, 5 billion and 50 million rows—using a variety of aggregation and filter queries, and reveals that distributed tables dominate at large scale while the gap narrows as the dataset shrinks.

ITPUB
ITPUB
ITPUB
Can ClickHouse Distributed Tables Outperform Single-Node Tables? A Real-World Benchmark

In response to a claim that ClickHouse (CK) local tables are "invincible" compared to distributed tables, the author designs a comprehensive performance test to verify the statement.

Test Design

The experiment uses the same CSV‑based web‑log dataset and creates two identical schemas: one as a local table on a single server and another as a distributed table spanning three identical nodes. The hardware of the single CK server is 128 GB RAM, a 56‑thread E5‑2680 CPU, SATA RAID‑0 disks, and a 1 Gbps NIC.

Three data‑volume scenarios are prepared:

Large: ~6 billion rows

Medium: 500 million rows

Small: 50 million rows

For each volume the same set of queries is executed on both table types, and execution time is recorded.

Schema Definition

CREATE TABLE table_name (
    `client_ip` String,
    `domain` String,
    `time` String,
    `target_ip` String,
    `rcode` String,
    `query_type` String,
    `authority_record` String,
    `add_msg` String,
    `dns_ip` String
) ENGINE = MergeTree
PRIMARY KEY client_ip
ORDER BY client_ip

Queries Executed

All queries are written in plain ClickHouse SQL; key examples include:

SELECT count(distinct client_ip) FROM table_name
SELECT lower(domain) AS domain, count(domain) AS count
FROM table_name
WHERE target_ip = ''
GROUP BY domain
SELECT client_ip, count(client_ip) AS count
FROM table_name
GROUP BY client_ip
ORDER BY count DESC
LIMIT 10
SELECT target_ip, count(target_ip) AS count
FROM table_name
WHERE (client_ip = '192.168.200.124') AND (isIPv4String(target_ip) = 1)
GROUP BY target_ip
ORDER BY count DESC
LIMIT 10
SELECT *
FROM table_name
WHERE (client_ip = '1.0.125.208')
  AND (domain = 'PuLL-hls-F6.DOuYiNcdN.COM.')
  AND (time = '20230522072154')

A complex window‑function query is also included to find the top‑10 client IPs with the most consecutive minute‑level accesses.

Results – Large Data Volume (~60 billion rows)

For the distinct‑client‑IP count, the distributed table finishes in ~7 seconds versus ~14 seconds for the local table (≈2× faster). For the domain‑group‑by query, the distributed table is ~3× faster (232 s vs 726 s). For the top‑10 client‑IP query, the distributed table is ~2.7× faster (0.12 s vs 0.29 s). Even the simplest filter query shows the distributed table slightly ahead (0.18 s vs 0.40 s). The conclusion: distributed tables win decisively on massive data.

Results – Medium Data Volume (5 billion rows)

Execution times shrink for both table types, but the distributed table remains consistently faster. Example: distinct‑client‑IP count takes ~1 s (distributed) vs ~1.6 s (local). The domain‑group‑by query drops to 31 s (distributed) vs 101 s (local). The top‑10 client‑IP query is 0.46 s vs 0.89 s. Overall, the performance gap narrows but the distributed table still leads.

Results – Small Data Volume (50 million rows)

When the dataset is small, the timing difference becomes marginal. Distinct‑client‑IP count: 0.24 s (distributed) vs 0.28 s (local). Domain‑group‑by: 4.6 s (distributed) vs 12.9 s (local) – still a win for distributed. Top‑10 client‑IP query: 0.12 s (distributed) vs 0.15 s (local). Simple filter queries are virtually identical (0.15 s vs 0.24 s). The author notes that at this scale the two approaches are almost indistinguishable.

Conclusions

The benchmark confirms the initial hypothesis: for ClickHouse tables exceeding the hundred‑million‑row range, distributed tables provide a clear performance advantage; for smaller tables the benefit diminishes and local tables may be sufficient. The author also mentions an anomaly in one complex query where the distributed result differed from the local result, flagging it for future investigation.

Overall recommendation: use distributed tables when a single‑node table would hold > 1 billion rows; otherwise, a local table is acceptable.

Large data volume chart
Large data volume chart
Local table result (large)
Local table result (large)
Distributed table result (large)
Distributed table result (large)
Medium data volume chart
Medium data volume chart
Small data volume chart
Small data volume chart
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceSQLClickHouseBenchmarkDistributed TablesLocal Tables
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.