How JD Achieves ClickHouse High‑Availability for Billion‑Scale OLAP
JD's OLAP platform runs on ClickHouse and Doris across 3,000 servers, handling billions of daily queries and petabytes of data, and this article details the selection criteria, cluster deployment models, high‑availability architecture, operational challenges, and future roadmap.
Application Scenarios and Technical Challenges
JD.com processes billions of analytical queries and petabytes of data daily. Core workloads include:
Transaction data : Requires multi‑table joins, precise calculations, and frequent dimension changes (e.g., position, SKU, order). Schema must support flexible back‑filling of historical data without full table rewrites.
Traffic data : Extremely high volume with heavy deduplication and grouping, leading to high CPU and memory consumption under concurrency.
User analysis : Metrics such as retention and conversion funnels need on‑the‑fly calculations across variable dimensions, preventing pre‑aggregation.
Large‑scale dashboards : Real‑time, sub‑second updates for thousands of metrics during peak events, demanding high availability and low latency.
OLAP Engine Selection Criteria
Four dimensions guided the choice:
Scale : Support hundreds of nodes and >10^11 daily analytical operations.
Adaptability : Cover most analytical use‑cases without adding extra components.
Flexibility : Allow dynamic column addition/removal and arbitrary dimension combinations.
Timeliness : End‑to‑end latency at minute or sub‑second level.
ClickHouse was selected as the primary engine for its performance, scalability, and stability; Doris was added for operational simplicity.
Cluster Deployment Models
Large shared cluster : Hundreds to thousands of nodes serving multiple product lines; high resource efficiency but low isolation.
Independent per‑business cluster : Strong isolation at higher operational cost.
Small multi‑tenant cluster : Medium‑sized shared clusters with quota‑based isolation, balancing cost and reliability.
High availability is achieved through physical isolation, multi‑tenant quotas, and workload segregation (analytical vs. reporting, high‑concurrency vs. low‑latency, business‑critical vs. background, real‑time vs. offline).
Resource Management and Configuration
Memory usage capped at 80% of physical RAM.
Parallel query count = 2‑5 × CPU cores.
Per‑query memory ≤ 20% of system memory; thread count ≤ 20% of CPU cores.
Query timeout: 10‑30 s (writes 60‑180 s).
Account‑level quotas for query, import, and download roles.
SQL tracing via a /*TraceID*/ comment for fast debugging.
Hardware is chosen per workload: CPU‑heavy nodes for aggregation, large HDD for long‑term storage, SSD/NVMe for low‑latency real‑time writes.
High‑Availability Architecture
Three‑layer design:
DNS resolution provides a stable endpoint.
CHProxy performs request routing, load balancing, and health‑check based failover.
ClickHouse nodes store data; ZooKeeper holds distributed DDL metadata and replica synchronization.
Management Panel centralizes cluster lifecycle operations (creation, scaling, node up/down, alerts). CHCopier handles inter‑cluster data migration.
Failure Handling
Multi‑replica storage guarantees data safety.
Read requests are automatically redirected to healthy replicas by CHProxy.
Writes succeed when any replica acknowledges; remaining replicas sync asynchronously.
DDL failures are mitigated by manual per‑node execution or node‑down‑then‑up strategies.
Node Lifecycle Operations
Decommission, addition, and replacement follow these steps:
Update cluster configuration (e.g., via Salt or Ansible).
Distribute the updated config to all nodes.
Synchronize ZooKeeper metadata (remove stale replica entries or add new ones).
Dual‑Active (Hot‑Standby) Cluster Design
A standby cluster is kept in sync via double‑write pipelines: Spark writes offline data and Flink writes real‑time streams to both primary and standby clusters, achieving eventual consistency and minute‑level failover.
Parameter Tuning
Key tuning areas:
Linux kernel parameters (e.g., vm.max_map_count, file descriptor limits).
ZooKeeper settings to reduce GC pauses and keep Znode count < 20 million.
JVM options for ZooKeeper and management services.
ClickHouse settings such as max_memory_usage, max_threads, and max_partitions_per_insert_block to prevent OOM and improve concurrency.
Columnar storage adjustments (e.g., increasing max_bytes_before_external_group_by) and query rewrite techniques (sub‑queries, local joins, materialized views).
Operational Issues and Mitigations
OOM and memory pressure : Enforce per‑query memory limits; monitor max_memory_usage.
Large part files : Keep part size 1‑10 GB; avoid excessive inode consumption.
Import frequency : Batch inserts to reduce part proliferation.
ZooKeeper‑metadata inconsistencies : Periodic checksum scripts; limit Znode growth.
Query skew and join performance : Rewrite joins to sub‑queries, use distributed_group_by_no_merge, and create materialized views for hot dimensions.
Concurrency limits : Apply quota rules (e.g., 5 queries per node, 20 queries per node per 10 s) and isolate high‑concurrency workloads.
Future Roadmap
Short‑term : Replace ZooKeeper with a Raft‑based consensus service to increase throughput and integrate tightly with ClickHouse metadata.
Mid‑term : Productize operational scripts into a self‑service OLAP control plane for resource provisioning, monitoring, and automated fault diagnosis.
Long‑term : Move toward cloud‑native OLAP using container orchestration, external storage (HDFS/object stores), and elastic scaling of compute nodes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
