Databases 24 min read

ClickHouse Practical Guide: Engine Selection, Cluster Architecture, and Operational Best Practices

This article provides a comprehensive overview of ClickHouse, covering its core use cases, detailed explanations of the various table engines, recommended schema and deployment patterns, performance‑tuning parameters, tooling choices, and operational guidelines for building and maintaining high‑availability OLAP clusters.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
ClickHouse Practical Guide: Engine Selection, Cluster Architecture, and Operational Best Practices

ClickHouse is an open‑source columnar DBMS designed for high‑performance OLAP workloads; the article introduces its popularity, typical application scenarios such as user behavior analysis, real‑time log processing, and data warehousing, and shares the experience of a large‑scale deployment handling billions of rows daily.

Table Engine Overview

The system offers four engine families—Log, MergeTree, Integration, and Special. The MergeTree family is the most widely used, supporting partitioning, primary keys, sampling, and TTL. Other engines include Log (lightweight), Integration (Kafka, MySQL, HDFS), and Special (Distributed, MaterializedView, Dictionary).

Key MergeTree Variants

• MergeTree : basic engine with sorting key, optional partitioning, primary key, sampling, and TTL. Example creation syntax:

<span>CREATE TABLE [IF NOT EXISTS] db.table_name [ON CLUSTER cluster]</span></code><code><span>(   name1 type1 [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],</span></code><code><span>    name2 type2 [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],</span></code><code><span>    ...</span></code><code><span>    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,</span></code><code><span>    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2</span></code><code><span>) ENGINE = MergeTree()</span></code><code><span>ORDER BY expr</span></code><code><span>[PARTITION BY expr]</span></code><code><span>[PRIMARY KEY expr]</span></code><code><span>[SAMPLE BY expr]</span></code><code><span>[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]</span></code><code><span>[SETTINGS name=value, ...]</span>

• ReplicatedMergeTree : adds Zookeeper‑based replication for high availability. Example:

<span>CREATE TABLE IF NOT EXISTS ads.ads_af_city_complaint_1d_local ON CLUSTER ycdata_3shards_3replicas</span></code><code><span>(`id` UInt64 COMMENT '序号', `order_id` UInt64 COMMENT '订单号', `gross_weight` UInt64 COMMENT '权重', `create_time` Date COMMENT '创建时间', `event` String COMMENT '事件')</span></code><code><span>ENGINE = ReplicatedMergeTree('/clickhouse/table/{shared}/ads_af_city_complaint_1d_local', '{replica}')</span></code><code><span>PARTITION BY create_time</span></code><code><span>ORDER BY id</span></code><code><span>TTL create_time + toIntervalDay(90)</span></code><code><span>SETTINGS index_granularity = 8192, use_minimalistic_part_header_in_zookeeper = 1;</span>

• ReplacingMergeTree : deduplicates rows with the same primary key during merges; optional version column controls which row is kept.

• SummingMergeTree : aggregates numeric columns of rows sharing the same sorting key, similar to a GROUP BY.

• AggregatingMergeTree : stores pre‑computed aggregate states, allowing custom aggregate functions.

• Distributed : a virtual table that forwards queries to underlying local shards; definition syntax:

<span>Distributed(cluster_name, database_name, table_name[, sharding_key])</span>

Development and Write Guidelines

SQL writing recommendations include preferring IN over JOIN for multi‑table queries, keeping the right table small in joins, avoiding SELECT *, limiting result sets, using partition keys, and minimizing string columns. Write operations should use batch inserts (5‑10 w rows) and avoid long‑living connections.

Table‑naming conventions: local tables end with _local, distributed tables with _shard; materialized view names end with _mv. Avoid Nullable types and set appropriate TTLs.

Cluster Architecture

A typical deployment uses 2‑shard × 2‑replica clusters (expandable to more shards). Each shard consists of a ClickHouse node, a Zookeeper ensemble (3 nodes), and chproxy instances for load balancing. For tables exceeding 100 billion rows, a sharding‑only architecture without replicas is suggested.

Zookeeper Role

Zookeeper coordinates distributed DDL and replicates state for ReplicatedMergeTree tables; heavy use can become a bottleneck, so the use_minimalistic_part_header_in_zookeeper=1 parameter is recommended.

chproxy

chproxy is a Go‑based HTTP proxy and load balancer for ClickHouse, providing routing, caching, rate limiting, and SSL management. Example query via chproxy:

<span>echo 'show databases;' | curl 'http://10.200.161.49:9009/?user=writeuser&password=xxxx' --data-binary @-</span>

Client Tools

Recommended clients are DBeaver (open‑source GUI) and Apache Superset (web‑based BI). Tabix is an alternative similar to Superset.

Availability and Configuration

High availability relies on replicated tables; sharding without replication reduces resilience. Critical ClickHouse server parameters include max_concurrent_queries, max_bytes_before_external_sort, background_pool_size, max_memory_usage, and max_bytes_before_external_group_by. Proper tuning of these values improves stability and performance.

Learning Resources

Official documentation: https://clickhouse.com/docs/en/engines/table-engines/integrations/; Chinese community: http://clickhouse.com.cn/.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance tuningClickHouseOLAPCluster ArchitectureTable Engines
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.