JD OLAP High‑Availability Practices: ClickHouse and Doris Deployment, Architecture, and Future Plans
This article details JD's OLAP implementation using ClickHouse as the primary engine and Doris as a secondary engine, covering business scenarios, selection criteria, multi‑tenant deployment, high‑availability architecture, encountered challenges, and future roadmap for cloud‑native, scalable analytics.
OLAP (On‑Line Analytical Processing) supports enterprise decision‑making and is the foundation for many reporting, BI, and analytics systems. JD's OLAP journey has evolved through Druid, Kylin, Doris, and now ClickHouse, serving numerous sub‑groups and scenarios with zero incidents during major sales events.
Business Scenarios and Engine Selection – JD handles massive transaction data, high‑volume traffic data, and real‑time dashboards. Requirements include massive data volume, timeliness, flexibility, and adaptability. ClickHouse and Doris were chosen as a dual‑engine strategy because they together satisfy these criteria, with ClickHouse offering strong performance and scalability, and Doris providing ease of use and lower operational cost.
Deployment and Operations Scheme – JD adopts a "small‑cluster, multi‑tenant" model, deploying dozens of clusters each serving multiple business lines. Deployment is customized per scenario (large storage, high concurrency, heavy queries). Isolation strategies include separating offline and real‑time workloads, separating reporting from analytical workloads, and separating high‑concurrency tasks.
Multi‑tenant support is achieved by allocating separate accounts with quota controls for query volume, concurrency, memory, and timeout limits. Queries are throttled based on per‑account limits, and resources are isolated via containerized, Kubernetes‑based deployments. An internal OLAP control plane allows users to request resources, configure monitoring, and manage clusters.
Resource Planning – For compute‑intensive workloads, CPUs with many cores (e.g., 32‑core) and ample memory (64‑128 GB) are recommended; storage choices depend on workload (HDD for offline, SSD/NVMe for real‑time). Single‑node performance is prioritized when possible, and shard/replica counts are calculated based on data size and QPS requirements.
High‑Availability Architecture – JD uses a three‑layer setup: DNS → CHProxy → ClickHouse nodes. CHProxy balances queries across healthy nodes and routes around failed nodes. For write operations, CHProxy directs writes to appropriate shards and replicas, skipping faulty replicas. Doris offers built‑in metadata consistency and replica health checks, providing smoother failover.
Additional HA measures include dual‑stream primary‑backup data centers, automated fault detection, regular smoke tests, and a metadata consistency checking tool. In case of node or data‑center failures, traffic is switched to standby clusters with minimal impact.
Issues Encountered and Future Plans – Challenges include ClickHouse's concurrency limits, join optimization, Zookeeper bottlenecks, shard rebalancing, and lack of exactly‑once semantics for imports. Future work focuses on a Raft‑based replacement for Zookeeper to improve metadata management, productizing the control plane for self‑service OLAP, and advancing cloud‑native OLAP with external storage separation and elastic scaling.
JD's OLAP team operates thousands of servers across transaction, traffic, and algorithmic workloads, actively contributing to the community while continuously improving reliability, usability, and performance.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.