Operations 13 min read

Evolution and Optimization of JD.com Order Center Elasticsearch Cluster Architecture

This article details how JD.com’s order center migrated its Elasticsearch cluster through multiple architectural stages—initial deployment, isolation, replica tuning, master‑slave adjustments, and real‑time dual‑cluster backup—while addressing data synchronization, scaling, and performance pitfalls to achieve high availability and query stability.

Big Data Technology Architecture

Feb 24, 2020

Evolution and Optimization of JD.com Order Center Elasticsearch Cluster Architecture

Background

JD.com’s order center experiences massive read‑heavy traffic, storing order data in MySQL but offloading queries to Elasticsearch (ES) due to its real‑time search capabilities. The ES cluster now holds over 1 billion documents with daily query volume of 500 million.

ES Cluster Evolution

1. Initial Stage

The cluster started with default settings on elastic cloud, with mixed node deployments and single‑point failures.

2. Isolation Stage

To avoid resource contention, high‑load nodes were moved off the elastic cloud to dedicated physical machines, improving stability.

3. Replica Tuning Stage

Nodes were placed on separate physical machines to maximize resource usage, and replica count was increased from 1 primary + 1 replica to 1 primary + 2 replicas, boosting throughput.

Shard count was tuned to balance single‑ID query throughput against aggregation pagination performance, resulting in an optimal shard configuration after extensive testing.

4. Master‑Slave Adjustment Stage

A standby cluster was introduced for failover; data is written synchronously to the primary and asynchronously to the standby, with Zookeeper controlling traffic switching.

5. Real‑Time Dual‑Cluster Stage

The primary cluster was upgraded from ES 1.7 to 6.x via index recreation, while the standby temporarily served as primary during upgrades. The standby stores hot recent data (≈10% of primary size) and handles most query traffic, whereas the primary becomes a cold‑data cluster for full‑order searches.

ES Order Data Synchronization

Two approaches were considered: (1) binlog listening to sync MySQL changes to ES, which adds coupling and maintenance overhead; (2) direct ES API writes from the business layer, chosen for its simplicity and low latency.

To ensure eventual consistency, failed writes trigger a compensating task that re‑updates ES based on the database record.

Encountered Pitfalls

1. High Real‑Time Query Requirements

Because ES refreshes every second, extremely latency‑sensitive queries are routed directly to MySQL.

2. Deep Pagination

Deep pagination (large from values) forces each shard to build large priority queues, causing high CPU, memory, and network usage; thus it should be avoided.

3. FieldData vs. Doc Values

Sorting on ES 1.x used fielddata, consuming JVM heap and causing OOM risks; switching to doc values (default from ES 2.x) stores sorting data on disk, improving stability.

Conclusion

The rapid iteration of the architecture was driven by business growth, leading to a highly available, scalable ES setup with dual clusters, optimized shard and replica settings, and robust data synchronization mechanisms, ensuring continued performance and stability for JD.com’s order services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Elasticsearch High Availability Data synchronization JD.com Cluster Architecture

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.