Cloud Native 19 min read

Cloud‑Native ClickHouse Architecture and Design Overview

This article presents a comprehensive design of a cloud‑native ClickHouse OLAP system, detailing its three‑layer architecture, storage‑compute separation, unified metadata management, high‑availability mechanisms, elastic scaling, cost reductions, and future enhancements for multi‑replica and MPP query support.

Tencent Database Technology

Sep 6, 2021

Overview

Based on ClickHouse and inspired by Snowflake, a high‑performance cloud‑native OLAP platform is built to provide a one‑stop data analysis solution for various scenarios, offering simplicity, high availability, low cost, and full compatibility with open‑source protocols, syntax, and storage formats.

Current ClickHouse Limitations

Complex usability requiring knowledge of local tables, distributed tables, and Zookeeper.

Stability issues due to heavy Zookeeper dependence, which becomes a central bottleneck.

Lack of distributed management features such as node addition/removal and replica balancing.

Insufficient MPP query layer and non‑standard syntax.

Architecture Design

The system consists of three layers:

Cluster Management Layer : Provides metadata management and a shared distributed task scheduler based on a consistency protocol.

Compute Layer : Users create compute clusters; each cluster contains multiple nodes that execute queries, sharing the management layer.

Storage Layer : Shared storage holds all data, accessible by multiple compute clusters, offering cheap, on‑demand, unlimited expansion.

Data Flow

Clients connect directly to ClickHouse nodes, bypassing a master node, avoiding a central bottleneck.

Data flows between ClickHouse nodes without passing through a master.

Control Flow

Control flow is coordinated by a global Master Node instead of Zookeeper.

All distributed DDL commands are routed to the Master, which persists metadata and coordinates execution across nodes.

The Master stores global schema information for future MPP query layers.

Node join/exit operations are managed by the Master, reducing deployment cost of Zookeeper.

Storage‑Compute Separation Mechanism

Strong consistency is achieved by placing data on shared storage, giving all nodes a consistent view.

Insert, Mutation, and Alter tasks are conflict‑handled via a shared Commit Log.

Merge/Mutation can run on any replica, accelerating merge speed under high‑throughput ingestion.

Unified Distributed Task Scheduling

When a user issues a CREATE TABLE on any ClickHouse node, the node forwards the DDL job to the Master, which parses it, generates DDL tasks for all nodes, and persists metadata before confirming success to the client.

High Availability and Multi‑Replica

Master Node is multi‑replicated with consistency protocols, ensuring availability even if the Master fails (read/write continues, only new DDL is blocked).

Concurrency control distinguishes DDL types, allowing sequential ALTER and concurrent Mutation.

Rollback mechanism aborts the entire job if any task fails, keeping node states consistent.

Garbage‑collection synchronizes node state with the Master to clean up obsolete tables.

Shared‑Storage Multi‑Replica Architecture

Data is stored on shared storage; a batch write creates a Part.

A Commit Log records Part changes (Add, Remove, etc.).

Each replica reads the Commit Log to update its in‑memory Part list without copying data.

Unique UUIDs replace part names to avoid naming conflicts.

Commit Log provides conflict handling and replay, with periodic snapshots for recovery.

Cost Reduction

Physical storage is shared among replicas, cutting storage cost by at least 50%.

Eliminates the need for dedicated Zookeeper clusters, saving at least three nodes.

Multi‑read/write model removes idle replica resources, improving utilization.

Stateless Data Service Layer

All metadata is detached from local storage; the Master holds schema information, and nodes cache it, re‑syncing from the Master if lost. Part management is fully on shared storage, making nodes stateless and enhancing overall service availability.

Elastic Scaling

New nodes fetch schema from the Master and Part metadata from shared storage, joining the cluster within seconds.

Removing nodes requires no data migration, allowing instant decommissioning.

Compatibility with Open‑Source Ecosystem

The design minimizes intrusion into ClickHouse code, preserving protocol, syntax, and storage format compatibility, enabling seamless upgrades with upstream community releases.

Future Work

Develop an MPP query engine with distributed join, aggregation, and support for multiple data sources (HDFS, S3, MySQL, Hive, Elasticsearch).

Eliminate sharding to provide a fully distributed, user‑transparent system, simplifying scaling and fault recovery.

Example Commands

ALTER CLUSTER cluster_name ADD BACKEND 'ip:port' TO SHARD 2;

SELECT * FROM SYSTEM.CLUSTERS;

CREATE TABLE t1 (
    partition_col_1 String,
    tc1 int,
    tc2 int
) ENGINE = MergeTree()
PARTITION BY partition_col_1
ORDER BY tc1;

distributed systems cloud native ClickHouse OLAP storage architecture elastic scaling

Written by

Tencent Database Technology

Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.