Why ClickHouse Beats Presto for Real‑Time Metrics: A Deep Dive
This article examines the shortcomings of a Storm‑based real‑time metric platform, outlines the requirements for a stable, SQL‑driven, fast engine, and explains why ClickHouse was chosen over Presto, detailing performance benchmarks, architectural advantages, cluster configuration, engine options, best practices, and common operational issues.
Background and Requirements
Rapid growth of business lines and metric volume exposed the limits of a Storm‑based real‑time computation platform: unintuitive metric definitions, no data back‑tracking, and frequent out‑of‑memory crashes. The new engine had to be:
Stable (avoid OOM and node failures)
SQL‑compatible (no custom topology conversion)
Support data back‑tracking
Provide sub‑5‑minute query latency for billions of rows
Evaluation and Choice of ClickHouse
Presto and Druid were benchmarked. Presto could not meet the sub‑5‑minute latency, and Druid required fixed dimensions and high maintenance. ClickHouse offered sub‑second query speed, flexible dimensions, and could reuse existing single‑node deployments, so it was selected.
Performance Comparison
Cluster specifications:
ClickHouse: 60 nodes (32 CPU, 128 GB RAM each) using ReplicatedMergeTree with two replicas per shard.
Presto: 100 nodes with identical hardware.
Results on a 110 billion‑row table:
Count(*) – ClickHouse < 2 s on 30 machines (half of the 60‑node cluster); Presto ~80 s on 100 machines.
Count + GROUP BY + ORDER BY + LIMIT – ClickHouse ~5 s on the same dataset; Presto ~100 s on a 400 billion‑row dataset.
Why ClickHouse Is Fast
Optimized C++ code – modern C++ features accelerate execution.
Vectorized columnar engine – SIMD instructions and LLVM‑based JIT compilation.
Sparse indexes – partition‑level and column‑level indexes reduce scanned data.
Storage‑execution coupling – eliminates network overhead between compute and storage.
SSD with high IOPS – fast disk I/O.
Table Types and Index Mechanics
ClickHouse distinguishes between local tables (actual storage) and distributed tables (routing layer). Queries are usually issued against distributed tables, which forward the request to the appropriate local tables and aggregate the results.
Indexes are sparse, with a default granularity of 8192 rows (≈12 k rows per index entry for a 1 billion‑row table). Primary and secondary indexes can be combined to narrow data blocks; columns without indexes require full scans.
Insert and Select Workflow
Inserts are batch‑oriented. Data is written directly to disk, sorted, partitioned, and merged in the background by ReplicatedMergeTree. The merge process may interact with ZooKeeper for coordination.
Selects on distributed tables are routed to local tables. The query engine uses the sparse indexes to locate relevant data blocks, then reads only those blocks from SSD.
Cluster Replication Strategies
Single replica – simple but any node or disk failure makes the cluster unavailable and risks data loss.
Multiple replicas – implemented via RAID or ClickHouse’s native ReplicatedMergeTree (ZooKeeper‑based). Provides fault tolerance, load‑balanced query routing, and higher concurrency.
MergeTree Engine Family
(Replicated)MergeTree– stores raw data; baseline engine. (Replicated)SummingMergeTree – additive pre‑aggregation during merges. (Replicated)ReplacingMergeTree – deduplicates rows, useful for “latest‑value” scenarios. (Replicated)AggregatingMergeTree – pre‑aggregates data, often paired with materialized views for fast UV/PV. (Replicated)CollapsingMergeTree – similar to ReplacingMergeTree but uses a sign column to collapse duplicates.
All variants perform aggregation in the merge phase, so queries may need special handling (e.g., using FINAL or materialized views).
Recommended Parameter Tuning
max_concurrent_queries: increase from default 100 to 150+ based on workload. max_bytes_before_external_sort: enable to spill to disk for large ORDER BY operations. background_pool_size: raise from 16 to 32 to accelerate merges. max_memory_usage and max_memory_usage_for_all_queries: set close to physical memory limits (leaving headroom for the OS). max_bytes_before_external_group_by: typically half of max_memory_usage to trigger disk‑based GROUP BY when needed.
Common Operational Issues and Solutions
Too many parts (e.g., 304) – merges slower than inserts High insert frequency creates many small parts. Mitigate by reducing insert rate, increasing background_pool_size , or redesigning partitioning to limit the number of new parts per second.
DB::NetException – connection reset Often caused by the ClickHouse server being killed due to memory spikes. Ensure max_memory_usage and related limits are properly configured.
Memory limit exceeded Queries exceeding max_memory_usage are terminated. Enable external GROUP BY/ORDER BY or increase the limits.
ZooKeeper snapshot size and sync timeouts Large snapshots (>2 GB) and many znodes (>4 M) strain ZooKeeper. Clean old snapshots, keep znodes below 4 M, and enable use_minimalistic_part_header_in_zookeeper to compress metadata.
ZooKeeper pressure → read‑only mode Separate ZooKeeper storage, use SSDs for ZooKeeper logs, or deploy dedicated ZooKeeper clusters per ClickHouse cluster.
Best Practices
Use local tables for real‑time ingestion; avoid distributed tables for low‑latency writes.
Prefer the MergeTree family of engines for stability and performance.
Be cautious with ON CLUSTER statements; they may hang and sometimes require manual intervention.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
