How JD.com Scaled Its Order Search with a Real‑Time Dual Elasticsearch Cluster
This article details JD.com’s order center journey from a simple Elasticsearch deployment to a highly available, dual‑cluster architecture, covering isolation, replica tuning, hot‑cold data separation, version upgrades, and practical lessons on pagination, field data, and doc values.
ES Cluster Architecture Evolution
JD.com’s order center stores order data in MySQL but relies on Elasticsearch (ES) for the majority of query traffic. With billions of documents and hundreds of millions of daily queries, the ES cluster has undergone several architectural upgrades to ensure stability and performance.
1. Initial Stage
The early ES deployment used default settings on elastic cloud instances, resulting in a chaotic node layout and a single‑point‑of‑failure risk for the order system.
2. Cluster Isolation Stage
To prevent resource contention, high‑load nodes were moved off the shared cloud, and eventually the ES cluster was migrated to dedicated high‑spec physical machines, improving performance and stability.
3. Node Replica Tuning Stage
Each ES node was placed on its own physical server to avoid intra‑machine resource competition. The replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, and additional machines were added, boosting throughput.
4. Primary‑Backup Cluster Adjustment Stage
A standby cluster was introduced to take over queries when the primary cluster experienced issues. Data is written to both clusters (dual‑write) with asynchronous writes to the backup. The backup stores recent hot data (≈10% of primary size) and can be promoted instantly during upgrades.
5. Current Real‑Time Dual‑Cluster Stage
The primary cluster was upgraded from ES 1.7 directly to ES 6.x via index recreation to avoid downtime. During the upgrade, the backup cluster temporarily served all queries, ensuring uninterrupted service.
Data Synchronization to ES
Two approaches exist: listening to MySQL binlog or writing directly via the ES API. JD.com chose the API method for simplicity and lower latency, with a compensation mechanism that inserts a remedial task into the database if an ES write fails, later processed by a worker to ensure eventual consistency.
Common Pitfalls
1. High Real‑Time Query Requirements
ES’s default refresh interval (~1 second) means newly indexed documents are not instantly searchable. For the most time‑sensitive queries, the system falls back to MySQL to guarantee up‑to‑date results.
2. Deep Pagination
Using large from values forces each shard to collect massive result sets, consuming CPU and bandwidth. The recommendation is to avoid deep pagination whenever possible.
3. FieldData vs. Doc Values
FieldData resides in JVM heap and can cause OOM or heavy GC when sorting large result sets. Switching to Doc Values (column‑oriented storage on disk) eliminates heap pressure and is the default in ES 2.x and later.
Conclusion
The rapid business growth of JD.com’s “to‑home” service drove continuous ES architecture evolution—from a single, default‑configured cluster to a dual, real‑time backup setup with tuned replicas, appropriate shard counts, and modern ES versions—demonstrating that the best architecture is the one that fits current performance, stability, and scalability needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
