How REDck Transformed ClickHouse into a Scalable Cloud‑Native Real‑Time Data Warehouse
REDck, a cloud‑native real‑time data warehouse built on open‑source ClickHouse, overcomes the original MPP architecture’s scaling and maintenance limits by separating compute and storage, introducing unified metadata, multi‑level caching, bucket‑based sharding, and distributed transaction support, delivering petabyte‑scale, 99.9% availability and ten‑fold cost and performance gains for Xiaohongshu’s diverse workloads.
Background
ClickHouse is a high‑performance OLAP engine used extensively at Xiaohongshu for advertising, community, live streaming, and e‑commerce workloads. The native shared‑nothing MPP design couples compute and storage on each node, which leads to high operational cost, limited elasticity, and fragile failure recovery.
Key Limitations of the Original ClickHouse Deployment
Elastic scaling difficulty : Adding nodes requires manual data rebalancing; migration windows are long and disruptive.
Low resource utilization : Multi‑replica safety inflates CPU and storage consumption; storage demand far exceeds compute needs.
Stability issues : Weak resource isolation causes query failures under load; Zookeeper becomes a bottleneck as the cluster grows.
Data‑sync inconsistencies : Absence of a distributed transaction system leads to occasional data divergence.
Maintainability : Managing distributed tables and replica balancing becomes increasingly complex with node count.
Design Goal
Build a cloud‑native, compute‑storage‑separated real‑time data warehouse (REDck) that retains ClickHouse’s query performance while providing elastic scaling, cost efficiency, and high availability for petabyte‑scale data.
Architecture Overview
REDck consists of three logical layers:
Unified Metadata Service : A stateless central service (MySQL for internal metadata, Hive/Iceberg for external) replaces per‑node local metadata, eliminates Zookeeper dependence, and provides consistent schema access.
Compute Layer : Distributed compute groups (Server + Worker nodes) run in Kubernetes/YARN. A Master role elects a global coordinator; Server nodes schedule tasks to Workers, enabling horizontal and vertical scaling.
Storage Layer : Object storage (e.g., Alibaba Cloud OSS, AWS S3) holds data in columnar formats such as Parquet, ORC, Avro, offering near‑infinite capacity and low cost.
Object‑Storage Access Optimizations
Cache acceleration : Multi‑threaded part download, parallel range fetching, and aggressive local caching improve query speed up to 100× for cached data and 10× for uncached data.
Query‑plan reordering : Adaptive multi‑threaded execution groups HTTP requests to the same part, reducing round‑trip latency to negligible levels.
Robust access module : Added data integrity checks, timeout detection, and exponential‑backoff retry logic to increase stability.
Multi‑Level Cache Architecture
REDck implements a three‑tier cache hierarchy:
Memory cache – fastest, smallest.
Local‑disk cache – larger, still high‑performance.
Distributed cache – largest, lowest speed.
Two caching strategies are provided:
Passive cache : On‑demand caching for unpredictable queries.
Active cache : Pre‑warming of hot data based on query history; data is stored on local disks for near‑disk read performance.
Cache eviction uses LRU and Clock‑Sweep algorithms; an in‑memory catalog enables fast lookup of cached parts.
Distributed Task Scheduling
A global Master elects a Server to act as the task coordinator. Tasks (Compaction, Mutation, Insert, Cache) are scheduled per bucket, guaranteeing ordered execution and avoiding conflicts during scaling. Transactional guards and anomaly‑detection logic prevent duplicate or conflicting tasks.
Bucket‑Based Data Sharding
Data is partitioned by hashing a chosen key (e.g., user‑ID) into a fixed number of buckets. Benefits include:
Fast point‑lookup filtering.
Reduced shuffle during aggregation and joins.
Bucket‑level scheduling simplifies elastic scaling.
Distributed Transactions
REDck adds a two‑phase commit (2PC) protocol built on the unified metadata service. The 2PC guarantees exactly‑once semantics for data ingestion, eliminating duplicate writes from Hive/Spark pipelines. Integration with Flink checkpoints extends exactly‑once guarantees across the full streaming pipeline.
Offline Sync Improvements
The offline ingestion path was switched from Flink micro‑batches to Spark batch jobs. This simplification reduces write amplification, supports INSERT OVERWRITE semantics, and provides a cleaner query experience.
Production Impact
After two years of deployment, REDck serves more than ten business lines (advertising, e‑commerce, live streaming, etc.) with the following outcomes:
Cost reduction : Storage grew from terabytes to tens of petabytes while per‑TB storage cost dropped ~10×.
Performance boost : CPU efficiency increased ~10×; query latency matches native ClickHouse despite using object storage.
High availability : 99.9% uptime; node failures no longer affect the whole cluster.
Scalability : Cluster expansion time shrank from weeks to minutes; virtual warehouses now exceed 100 nodes with seamless load balancing.
Key Diagrams
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
