Operations 13 min read

Scaling JD.com Order Search: Real‑Time Dual‑Cluster Elasticsearch Architecture

This article details how JD.com’s order center evolved its Elasticsearch deployment from a single, default‑configured cluster to a real‑time, dual‑cluster architecture with replica tuning, master‑slave failover, version upgrades, and optimized data synchronization to handle billions of documents and hundreds of millions of daily queries.

ITPUB
ITPUB
ITPUB
Scaling JD.com Order Search: Real‑Time Dual‑Cluster Elasticsearch Architecture

Background

JD.com’s order center processes massive read‑heavy traffic, storing order data in MySQL while delegating most query load to Elasticsearch (ES). The ES cluster eventually held over 1 billion documents and served 500 million queries per day.

1. Initial Stage

In the early days the ES cluster was deployed on elastic cloud with default settings and a chaotic node layout, leading to single‑point failures that were unacceptable for order processing.

2. Cluster Isolation Stage

Mixed‑tenant deployments caused resource contention, degrading query stability. JD.com first migrated high‑resource‑consuming nodes away from the elastic cloud, then moved the ES cluster onto dedicated high‑spec physical machines, improving performance.

3. Node Replica Tuning Stage

Running multiple ES nodes on the same physical host still caused resource competition. The solution was to allocate one ES node per physical machine. To increase throughput, the replica factor was changed from 1 primary + 1 replica to 1 primary + 2 replicas, and additional machines were added.

Order center ES cluster diagram
Order center ES cluster diagram

4. Master‑Slave Adjustment Stage

To avoid service disruption when a node fails, a standby cluster was introduced. Data is written to both clusters (primary synchronously, standby asynchronously). Older orders are archived to keep the standby’s data volume about one‑tenth of the primary, and ZooKeeper controls traffic switching.

Master‑slave cluster diagram
Master‑slave cluster diagram

5. Current Real‑Time Dual‑Cluster Stage

The primary cluster was upgraded from ES 1.7 directly to 6.x, requiring index recreation. During upgrades the standby cluster temporarily served all traffic. The standby now stores hot recent data (≈10 % of primary size) and handles most query load, while the primary stores the full dataset for occasional full‑order searches, effectively becoming a cold data cluster.

Dual‑cluster architecture
Dual‑cluster architecture

6. Data Synchronization Strategies

Two approaches were considered for syncing MySQL data to ES:

Listening to MySQL binlog and pushing changes to ES (asynchronous).

Directly writing to ES via its API (synchronous).

Given the high real‑time requirement, the team chose the API‑based method for simplicity and lower latency. To guarantee eventual consistency, a compensation mechanism records failed writes in MySQL and a worker process retries them.

7. Encountered Pitfalls

7.1 High‑Real‑Time Queries

ES’s default refresh interval (1 s) means newly indexed documents become searchable within a second, not instantly. For the most time‑sensitive queries, JD.com still queries MySQL directly.

7.2 Deep Pagination

Using large from values forces each shard to build large priority queues, consuming CPU and memory. The team advises avoiding deep pagination and using alternatives such as search‑after.

7.3 FieldData vs. Doc Values

Sorting on ES 1.x relied on FieldData, which lives in JVM heap and can cause OOM or GC pauses. Switching to Doc Values (column‑oriented storage on disk) eliminated heap pressure and improved stability.

Conclusion

The ES architecture continuously evolved alongside JD.com’s rapid business growth. By iteratively isolating resources, tuning replicas, introducing a standby cluster, upgrading versions, and optimizing data sync, the order center achieved higher throughput, lower latency, and stronger resilience, illustrating that the best architecture is the one that fits current needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationElasticsearchhigh availabilitydata synchronizationCluster Architecture
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.