Scaling JD.com Order Search: Real‑Time Dual‑Cluster Elasticsearch Architecture
This article details how JD.com’s order center evolved its Elasticsearch deployment from a single, default‑configured cluster to a real‑time, dual‑cluster architecture with replica tuning, master‑slave failover, version upgrades, and optimized data synchronization to handle billions of documents and hundreds of millions of daily queries.
Background
JD.com’s order center processes massive read‑heavy traffic, storing order data in MySQL while delegating most query load to Elasticsearch (ES). The ES cluster eventually held over 1 billion documents and served 500 million queries per day.
1. Initial Stage
In the early days the ES cluster was deployed on elastic cloud with default settings and a chaotic node layout, leading to single‑point failures that were unacceptable for order processing.
2. Cluster Isolation Stage
Mixed‑tenant deployments caused resource contention, degrading query stability. JD.com first migrated high‑resource‑consuming nodes away from the elastic cloud, then moved the ES cluster onto dedicated high‑spec physical machines, improving performance.
3. Node Replica Tuning Stage
Running multiple ES nodes on the same physical host still caused resource competition. The solution was to allocate one ES node per physical machine. To increase throughput, the replica factor was changed from 1 primary + 1 replica to 1 primary + 2 replicas, and additional machines were added.
4. Master‑Slave Adjustment Stage
To avoid service disruption when a node fails, a standby cluster was introduced. Data is written to both clusters (primary synchronously, standby asynchronously). Older orders are archived to keep the standby’s data volume about one‑tenth of the primary, and ZooKeeper controls traffic switching.
5. Current Real‑Time Dual‑Cluster Stage
The primary cluster was upgraded from ES 1.7 directly to 6.x, requiring index recreation. During upgrades the standby cluster temporarily served all traffic. The standby now stores hot recent data (≈10 % of primary size) and handles most query load, while the primary stores the full dataset for occasional full‑order searches, effectively becoming a cold data cluster.
6. Data Synchronization Strategies
Two approaches were considered for syncing MySQL data to ES:
Listening to MySQL binlog and pushing changes to ES (asynchronous).
Directly writing to ES via its API (synchronous).
Given the high real‑time requirement, the team chose the API‑based method for simplicity and lower latency. To guarantee eventual consistency, a compensation mechanism records failed writes in MySQL and a worker process retries them.
7. Encountered Pitfalls
7.1 High‑Real‑Time Queries
ES’s default refresh interval (1 s) means newly indexed documents become searchable within a second, not instantly. For the most time‑sensitive queries, JD.com still queries MySQL directly.
7.2 Deep Pagination
Using large from values forces each shard to build large priority queues, consuming CPU and memory. The team advises avoiding deep pagination and using alternatives such as search‑after.
7.3 FieldData vs. Doc Values
Sorting on ES 1.x relied on FieldData, which lives in JVM heap and can cause OOM or GC pauses. Switching to Doc Values (column‑oriented storage on disk) eliminated heap pressure and improved stability.
Conclusion
The ES architecture continuously evolved alongside JD.com’s rapid business growth. By iteratively isolating resources, tuning replicas, introducing a standby cluster, upgrading versions, and optimizing data sync, the order center achieved higher throughput, lower latency, and stronger resilience, illustrating that the best architecture is the one that fits current needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
