How JD.com Scaled Its Order Search with Elasticsearch: From Chaos to Real‑Time Dual Clusters
This article details how JD.com’s order center migrated from a MySQL‑only design to a high‑performance Elasticsearch cluster, evolving through isolation, replica tuning, master‑slave adjustments, and real‑time dual‑cluster architecture to achieve billions of documents, hundreds of millions of daily queries, and robust fault tolerance.
In JD.com’s order‑to‑home business, both external merchant orders and internal system dependencies generate massive query traffic, creating a read‑heavy, write‑light workload that MySQL alone could not sustain.
The team initially stored order data in MySQL but quickly adopted Elasticsearch (ES) to offload the majority of query load, as ES provides near‑real‑time search capabilities.
Today the ES cluster holds over 1 billion documents and handles about 500 million queries per day. As the business grew, the ES architecture evolved through several stages:
1. Initial Stage
The cluster was first deployed on elastic cloud with default settings, leading to single‑point failures and resource contention.
2. Isolation Stage
To avoid resource competition, the ES nodes were moved from shared cloud instances to dedicated high‑spec physical machines, improving stability and performance.
3. Replica Optimization Stage
Each node was placed on its own physical server, and the replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, adding more machines to boost throughput.
4. Master‑Slave Adjustment Stage
To guarantee high availability, a standby cluster was introduced. Data is written synchronously to the primary cluster and asynchronously to the standby. When the primary fails, the standby takes over, and a ZooKeeper‑controlled switch directs traffic.
5. Real‑Time Dual‑Cluster Stage
The primary cluster was upgraded from ES 1.7 to ES 6.x, requiring index recreation. During the upgrade, the standby cluster served all queries, ensuring zero downtime. The standby now stores hot recent data (≈10 % of total docs) and handles the majority of query traffic, while the primary stores the full cold dataset.
Data Synchronization
Two approaches were considered: listening to MySQL binlog or using the ES API directly. JD.com chose the API method for lower latency and simpler maintenance, writing orders to ES in real time.
When write failures occur, a compensating task is recorded in the database; a worker later re‑processes these entries to ensure eventual consistency.
Common Pitfalls
High‑real‑time queries should bypass ES and hit the database directly. ES refreshes every second, so newly indexed documents may not be immediately searchable.
Avoid deep pagination. Large from values cause each shard to build huge priority queues, consuming CPU and memory.
FieldData vs. Doc Values. Sorting on older ES versions used FieldData, which occupies JVM heap and can cause OOM; newer versions default to Doc Values, storing data on disk and improving stability.
Conclusion
The rapid iteration of the order‑center architecture mirrors JD.com’s fast‑growing business. Continuous upgrades—isolating resources, expanding replicas, introducing dual clusters, and migrating to newer ES versions—have dramatically increased throughput, performance, and reliability, illustrating that there is no “best” architecture, only the one that best fits current needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
