Can ClickHouse Distributed Tables Outperform Single-Node Tables? A Real-World Benchmark
This article presents a systematic benchmark comparing ClickHouse local (single‑node) tables and distributed tables across three data volumes—≈60 billion, 5 billion and 50 million rows—using a variety of aggregation and filter queries, and reveals that distributed tables dominate at large scale while the gap narrows as the dataset shrinks.
In response to a claim that ClickHouse (CK) local tables are "invincible" compared to distributed tables, the author designs a comprehensive performance test to verify the statement.
Test Design
The experiment uses the same CSV‑based web‑log dataset and creates two identical schemas: one as a local table on a single server and another as a distributed table spanning three identical nodes. The hardware of the single CK server is 128 GB RAM, a 56‑thread E5‑2680 CPU, SATA RAID‑0 disks, and a 1 Gbps NIC.
Three data‑volume scenarios are prepared:
Large: ~6 billion rows
Medium: 500 million rows
Small: 50 million rows
For each volume the same set of queries is executed on both table types, and execution time is recorded.
Schema Definition
CREATE TABLE table_name (
`client_ip` String,
`domain` String,
`time` String,
`target_ip` String,
`rcode` String,
`query_type` String,
`authority_record` String,
`add_msg` String,
`dns_ip` String
) ENGINE = MergeTree
PRIMARY KEY client_ip
ORDER BY client_ipQueries Executed
All queries are written in plain ClickHouse SQL; key examples include:
SELECT count(distinct client_ip) FROM table_name SELECT lower(domain) AS domain, count(domain) AS count
FROM table_name
WHERE target_ip = ''
GROUP BY domain SELECT client_ip, count(client_ip) AS count
FROM table_name
GROUP BY client_ip
ORDER BY count DESC
LIMIT 10 SELECT target_ip, count(target_ip) AS count
FROM table_name
WHERE (client_ip = '192.168.200.124') AND (isIPv4String(target_ip) = 1)
GROUP BY target_ip
ORDER BY count DESC
LIMIT 10 SELECT *
FROM table_name
WHERE (client_ip = '1.0.125.208')
AND (domain = 'PuLL-hls-F6.DOuYiNcdN.COM.')
AND (time = '20230522072154')A complex window‑function query is also included to find the top‑10 client IPs with the most consecutive minute‑level accesses.
Results – Large Data Volume (~60 billion rows)
For the distinct‑client‑IP count, the distributed table finishes in ~7 seconds versus ~14 seconds for the local table (≈2× faster). For the domain‑group‑by query, the distributed table is ~3× faster (232 s vs 726 s). For the top‑10 client‑IP query, the distributed table is ~2.7× faster (0.12 s vs 0.29 s). Even the simplest filter query shows the distributed table slightly ahead (0.18 s vs 0.40 s). The conclusion: distributed tables win decisively on massive data.
Results – Medium Data Volume (5 billion rows)
Execution times shrink for both table types, but the distributed table remains consistently faster. Example: distinct‑client‑IP count takes ~1 s (distributed) vs ~1.6 s (local). The domain‑group‑by query drops to 31 s (distributed) vs 101 s (local). The top‑10 client‑IP query is 0.46 s vs 0.89 s. Overall, the performance gap narrows but the distributed table still leads.
Results – Small Data Volume (50 million rows)
When the dataset is small, the timing difference becomes marginal. Distinct‑client‑IP count: 0.24 s (distributed) vs 0.28 s (local). Domain‑group‑by: 4.6 s (distributed) vs 12.9 s (local) – still a win for distributed. Top‑10 client‑IP query: 0.12 s (distributed) vs 0.15 s (local). Simple filter queries are virtually identical (0.15 s vs 0.24 s). The author notes that at this scale the two approaches are almost indistinguishable.
Conclusions
The benchmark confirms the initial hypothesis: for ClickHouse tables exceeding the hundred‑million‑row range, distributed tables provide a clear performance advantage; for smaller tables the benefit diminishes and local tables may be sufficient. The author also mentions an anomaly in one complex query where the distributed result differed from the local result, flagging it for future investigation.
Overall recommendation: use distributed tables when a single‑node table would hold > 1 billion rows; otherwise, a local table is acceptable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
