How JD.com Scaled Its Order System with Elasticsearch: Architecture Evolution
This article details how JD.com's order center migrated from MySQL‑only reads to a high‑throughput Elasticsearch cluster, describing each architectural phase—from the initial bare‑metal setup, through isolation, replica tuning, primary‑secondary adjustments, to the current real‑time dual‑cluster—while sharing synchronization strategies and performance pitfalls.
JD.com’s order center stores order data in MySQL, but the massive query volume makes a pure DB solution impractical, so Elasticsearch (ES) is used to handle the majority of order queries.
ES Cluster Architecture Evolution
1. Initial Stage
The ES cluster started with default configurations on elastic cloud instances, resulting in a chaotic node layout and single‑point failures that were unacceptable for order processing.
2. Cluster Isolation Stage
Mixed‑deployment caused resource contention; high‑resource nodes were moved to dedicated physical machines, improving stability and performance.
3. Node Replica Optimization
Each ES node was placed on its own physical server to avoid intra‑node resource competition, and the replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, boosting throughput.
4. Primary‑Secondary Cluster Adjustment
A standby cluster was introduced for failover. Data is written to the primary cluster synchronously and to the standby asynchronously; during primary upgrades, the standby takes over to ensure zero‑downtime service.
5. Current Real‑time Dual‑Cluster Stage
The primary cluster was upgraded from ES 1.7 to 6.x, and both clusters now operate with a dual‑write strategy, where the standby holds recent hot data (≈10% of total) and the primary stores the full dataset, effectively separating hot and cold workloads.
ES Order Data Synchronization
Two approaches were considered: listening to MySQL binlog or directly using the ES API. JD.com chose the API method for its simplicity and low latency, writing orders to ES in real time.
When a write fails, a compensating task is created; a worker later reprocesses these tasks to ensure eventual consistency between MySQL and ES.
Pitfalls Encountered
1. High Real‑time Requirements
Because ES refreshes shards every second, queries needing absolute real‑time data are routed directly to MySQL.
2. Avoid Deep Pagination
Deep pagination (large from values) forces each shard to generate huge result sets, consuming CPU and memory; thus it should be avoided.
3. FieldData vs. Doc Values
FieldData, stored in JVM heap, caused OOM and slowdowns during sorting; switching to Doc Values (column‑oriented storage on disk) resolved the issue.
Conclusion
The rapid iteration of the order center’s architecture was driven by business growth; continuous optimization of ES clusters—through isolation, replica scaling, dual‑cluster deployment, and version upgrades—has significantly improved throughput, performance, and stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
