ClickHouse Best Practices: Table Engines, Cluster Architecture, and Operational Guidelines
This guide provides a comprehensive overview of ClickHouse, covering its core use cases, detailed table‑engine choices, cluster design, Zookeeper integration, query and data‑loading best practices, client tools, and key configuration parameters to ensure high performance and reliability in OLAP workloads.
ClickHouse is an open‑source columnar DBMS widely used for OLAP analytics; the author shares practical development and usage standards based on experience.
Application scenarios include user behavior analysis, real‑time log monitoring, data warehousing, AB testing, and machine‑generated logs, handling billions of rows daily with sub‑second query latency.
Table engine selection – ClickHouse offers four engine families (Log, MergeTree, Integration, Special) with Replicated and Distributed variants. The MergeTree family is the most versatile and is the primary choice for most workloads.
MergeTree engine stores data in sorted order, supports partitioning, replication, and sampling. Example DDL:
CREATE TABLE [IF NOT EXISTS] db.table_name [ON CLUSTER cluster]
(
name1 type1 [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
name2 type2 [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
...
INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]
[SETTINGS name=value, ...]Key options: ORDER BY (required), PARTITION BY (strongly recommended), PRIMARY KEY (optional), SAMPLE BY, TTL (highly recommended for large tables), and SETTINGS (e.g., index_granularity=8192).
ReplicatedMergeTree adds high‑availability via ZooKeeper; suitable for production but can stress ZooKeeper at very large scales. Example DDL:
CREATE TABLE [IF NOT EXISTS] db.table_name [ON CLUSTER cluster]
(
`id` Int64,
`ymd` Int64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/replicated/{shard}/test', '{replica}')
PARTITION BY ymd
ORDER BY idReplacingMergeTree deduplicates rows with the same primary key during merges; optional version column controls which row is kept.
CREATE TABLE [IF NOT EXISTS] db.table_name [ON CLUSTER cluster]
(
name1 type1,
name2 type2,
...
) ENGINE = ReplacingMergeTree([ver])
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[SETTINGS name=value, ...]SummingMergeTree aggregates numeric columns with identical sorting keys, similar to a GROUP BY operation.
CREATE TABLE [IF NOT EXISTS] db.table_name [ON CLUSTER cluster]
(
name1 type1,
name2 type2,
...
) ENGINE = SummingMergeTree([columns])
[PARTITION BY expr]
[ORDER BY expr]
[SETTINGS name=value, ...]AggregatingMergeTree allows custom aggregate functions for incremental statistics.
CREATE TABLE [IF NOT EXISTS] db.table_name [ON CLUSTER cluster]
(
name1 type1,
name2 type2,
...
) ENGINE = AggregatingMergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SETTINGS name=value, ...]Distributed engine provides a logical view that routes queries to underlying local shards; it does not store data itself.
Distributed(cluster_name, database_name, table_name[, sharding_key])Development standards cover SQL query guidelines (prefer IN over JOIN for single‑table results, keep right table small, avoid SELECT *, use LIMIT, include partition key in queries, limit string columns, etc.), data‑write best practices (batch inserts of 50k‑100k rows, avoid high‑concurrency writes, specify partition keys, limit number of partitions), and naming conventions for local ( *_local) and distributed ( *_shard) tables, materialized views, and TTL usage.
Cluster architecture – typical 2‑shard, 2‑replica setup on four machines, scalable to more shards; local tables handle writes, distributed tables handle reads; recommends avoiding replicas for tables >100 billion rows.
ZooKeeper role – manages distributed DDL and replication state; can become a bottleneck; recommended configuration: three 32 GB/4 CPU nodes with 10 GbE and 80‑200 GB disks; enable use_minimalistic_part_header_in_zookeeper=1 to reduce load.
chproxy – Go‑based HTTP proxy and load balancer for ClickHouse; provides routing, caching, rate limiting, and automatic SSL renewal. Example test command:
echo 'show databases;' | curl 'http://10.200.161.49:9009/?user=writeuser&password=xxxx' --data-binary @-Client tools – DBeaver (open‑source DB admin), Superset (BI dashboard), Tabix (similar to Superset).
Availability considerations – replication ensures high availability; sharding without replication reduces resilience; ZooKeeper outage impacts writes regardless of replication.
Key configuration parameters – max_concurrent_queries (default 100, recommend 150), max_bytes_before_external_sort, background_pool_size (default 16, recommend 32), max_memory_usage, max_memory_usage_for_all_queries, max_bytes_before_external_group_by (typically half of max_memory_usage) to control disk‑based sorting/grouping.
Overall, the guide consolidates ClickHouse deployment, engine selection, operational best practices, and performance tuning to help teams build reliable, high‑throughput OLAP solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
