Operations 13 min read

How JD.com Scaled Its Order Search with a Real‑Time Dual Elasticsearch Cluster

This article details JD.com’s order center journey from a simple Elasticsearch deployment to a highly available, dual‑cluster architecture, covering isolation, replica tuning, hot‑cold data separation, version upgrades, and practical lessons on pagination, field data, and doc values.

ITPUB
ITPUB
ITPUB
How JD.com Scaled Its Order Search with a Real‑Time Dual Elasticsearch Cluster

ES Cluster Architecture Evolution

JD.com’s order center stores order data in MySQL but relies on Elasticsearch (ES) for the majority of query traffic. With billions of documents and hundreds of millions of daily queries, the ES cluster has undergone several architectural upgrades to ensure stability and performance.

1. Initial Stage

The early ES deployment used default settings on elastic cloud instances, resulting in a chaotic node layout and a single‑point‑of‑failure risk for the order system.

2. Cluster Isolation Stage

To prevent resource contention, high‑load nodes were moved off the shared cloud, and eventually the ES cluster was migrated to dedicated high‑spec physical machines, improving performance and stability.

3. Node Replica Tuning Stage

Each ES node was placed on its own physical server to avoid intra‑machine resource competition. The replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, and additional machines were added, boosting throughput.

4. Primary‑Backup Cluster Adjustment Stage

A standby cluster was introduced to take over queries when the primary cluster experienced issues. Data is written to both clusters (dual‑write) with asynchronous writes to the backup. The backup stores recent hot data (≈10% of primary size) and can be promoted instantly during upgrades.

5. Current Real‑Time Dual‑Cluster Stage

The primary cluster was upgraded from ES 1.7 directly to ES 6.x via index recreation to avoid downtime. During the upgrade, the backup cluster temporarily served all queries, ensuring uninterrupted service.

Data Synchronization to ES

Two approaches exist: listening to MySQL binlog or writing directly via the ES API. JD.com chose the API method for simplicity and lower latency, with a compensation mechanism that inserts a remedial task into the database if an ES write fails, later processed by a worker to ensure eventual consistency.

Common Pitfalls

1. High Real‑Time Query Requirements

ES’s default refresh interval (~1 second) means newly indexed documents are not instantly searchable. For the most time‑sensitive queries, the system falls back to MySQL to guarantee up‑to‑date results.

2. Deep Pagination

Using large from values forces each shard to collect massive result sets, consuming CPU and bandwidth. The recommendation is to avoid deep pagination whenever possible.

3. FieldData vs. Doc Values

FieldData resides in JVM heap and can cause OOM or heavy GC when sorting large result sets. Switching to Doc Values (column‑oriented storage on disk) eliminates heap pressure and is the default in ES 2.x and later.

Conclusion

The rapid business growth of JD.com’s “to‑home” service drove continuous ES architecture evolution—from a single, default‑configured cluster to a dual, real‑time backup setup with tuned replicas, appropriate shard counts, and modern ES versions—demonstrating that the best architecture is the one that fits current performance, stability, and scalability needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchhigh availabilityperformance tuningdata synchronizationCluster Architecture
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.