Design of Cloud‑Native ClickHouse: Architecture, Storage‑Compute Separation, and MPP Query Layer
This article presents the cloud‑native redesign of ClickHouse, covering its current technical limitations in storage and computation, the proposed storage‑compute separation with DDL task management, multi‑replica and CommitLog mechanisms, and a new MPP query layer to meet future data‑warehouse demands such as real‑time analytics, flexibility, high throughput, low cost, and support for semi‑structured data.
The presentation introduces the topic "Cloud‑Native ClickHouse Design" and outlines the motivation for improving Tencent TEG's cloud architecture platform's handling of ClickHouse.
Technical Status
Storage Pain Points: ClickHouse uses Zookeeper for coordination, forming a peer‑to‑peer cluster with multiple shards and replicas. Users must know the local table and hash distribution for writes, and scaling requires manual table recreation and data rebalancing, leading to high maintenance costs and Zookeeper metadata pressure.
Computation Pain Points: ClickHouse performs aggregation in two stages—local pre‑aggregation on each node followed by final aggregation on a single node—causing concurrency, memory, and lack of shuffle issues, as well as limited SQL optimizer capabilities.
Future Data‑Warehouse Requirements
Real‑time analytics (minute‑ or second‑level reporting).
Flexibility for diverse query patterns.
Unified data‑analysis system to reduce component redundancy.
High throughput for large‑scale data ingestion.
Low‑cost storage with hot/cold tiers and elastic scaling.
Support for semi‑structured and unstructured data.
Goals
Build a cloud‑native data warehouse based on ClickHouse's strong single‑node engine, featuring six characteristics: MySQL‑compatible simple SQL, multi‑load support, semi‑structured data handling, extreme performance, low cost with elastic storage, and SaaS‑style user experience.
Architecture
Inspired by Snowflake, the architecture consists of three layers: a shared storage layer on Tencent COS, a middle layer of isolated CVM instances, and compute clusters that can read/write shared storage concurrently.
Storage‑Compute Separation
1. DDL Task Management: Introduce a Master node to coordinate DDL operations, ensuring atomic table creation across all nodes with rollback on failure, and building a catalog for automatic scaling and MPP planning.
2. Multi‑Replica Mechanism: Shift replicas from data reliability to compute availability, allowing each replica to serve reads and writes, and using a CommitLog to resolve write conflicts.
3. CommitLog Mechanism: Replace part filenames with UUIDs, record writes in a Log, and use the Log to resolve concurrent writes and enable fast recovery via periodic snapshots stored in S3.
MPP Query Layer
Distributed Query Issues: Existing ClickHouse lacks shuffle and suffers from single‑node aggregation, leading to memory bottlenecks for large group‑by operations.
Proposed Solution: Integrate an MPP executor (based on Apache Doris) within the ClickHouse process, sharing the same Block data structure to achieve zero‑copy data transfer, vectorized operators, and high performance without cross‑process serialization.
The new MPP layer distributes query planning via a Master node, executes parallel shuffles, and avoids bottlenecks, while the Master can be scaled horizontally.
Future Planning
Enhance support for complex and semi‑structured data types, implement an asynchronous execution engine, improve query planning for wide tables, add ETL capabilities on top of the MPP framework, introduce local caching for storage, hide data distribution details from users, and enable multi‑cluster isolation.
In summary, the redesign aims to combine ClickHouse's high‑performance single‑node engine with cloud‑native storage‑compute separation and an integrated MPP query layer to meet modern data‑warehouse requirements.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.