Evolution and Optimization of JD.com Order Center Elasticsearch Cluster Architecture
This article details how JD.com’s order center migrated its Elasticsearch cluster through multiple architectural stages—initial deployment, isolation, replica tuning, master‑slave adjustments, and real‑time dual‑cluster backup—while addressing data synchronization, scaling, and performance pitfalls to achieve high availability and query stability.
Background
JD.com’s order center experiences massive read‑heavy traffic, storing order data in MySQL but offloading queries to Elasticsearch (ES) due to its real‑time search capabilities. The ES cluster now holds over 1 billion documents with daily query volume of 500 million.
ES Cluster Evolution
1. Initial Stage
The cluster started with default settings on elastic cloud, with mixed node deployments and single‑point failures.
2. Isolation Stage
To avoid resource contention, high‑load nodes were moved off the elastic cloud to dedicated physical machines, improving stability.
3. Replica Tuning Stage
Nodes were placed on separate physical machines to maximize resource usage, and replica count was increased from 1 primary + 1 replica to 1 primary + 2 replicas, boosting throughput.
Shard count was tuned to balance single‑ID query throughput against aggregation pagination performance, resulting in an optimal shard configuration after extensive testing.
4. Master‑Slave Adjustment Stage
A standby cluster was introduced for failover; data is written synchronously to the primary and asynchronously to the standby, with Zookeeper controlling traffic switching.
5. Real‑Time Dual‑Cluster Stage
The primary cluster was upgraded from ES 1.7 to 6.x via index recreation, while the standby temporarily served as primary during upgrades. The standby stores hot recent data (≈10% of primary size) and handles most query traffic, whereas the primary becomes a cold‑data cluster for full‑order searches.
ES Order Data Synchronization
Two approaches were considered: (1) binlog listening to sync MySQL changes to ES, which adds coupling and maintenance overhead; (2) direct ES API writes from the business layer, chosen for its simplicity and low latency.
To ensure eventual consistency, failed writes trigger a compensating task that re‑updates ES based on the database record.
Encountered Pitfalls
1. High Real‑Time Query Requirements
Because ES refreshes every second, extremely latency‑sensitive queries are routed directly to MySQL.
2. Deep Pagination
Deep pagination (large from values) forces each shard to build large priority queues, causing high CPU, memory, and network usage; thus it should be avoided.
3. FieldData vs. Doc Values
Sorting on ES 1.x used fielddata, consuming JVM heap and causing OOM risks; switching to doc values (default from ES 2.x) stores sorting data on disk, improving stability.
Conclusion
The rapid iteration of the architecture was driven by business growth, leading to a highly available, scalable ES setup with dual clusters, optimized shard and replica settings, and robust data synchronization mechanisms, ensuring continued performance and stability for JD.com’s order services.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.