Operations 13 min read

Evolution of JD.com Order Center Elasticsearch Cluster Architecture and Lessons Learned

This article details the progressive evolution of JD.com’s order center Elasticsearch cluster—from its initial default setup through isolation, replica optimization, master‑slave adjustments, and real‑time dual‑cluster backup—highlighting architectural decisions, scaling strategies, synchronization methods, and operational challenges encountered.

Selected Java Interview Questions
Selected Java Interview Questions
Selected Java Interview Questions
Evolution of JD.com Order Center Elasticsearch Cluster Architecture and Lessons Learned

In JD.com’s order‑to‑home business, the massive volume of order queries caused a read‑heavy workload that could not be efficiently handled by MySQL alone, prompting the adoption of Elasticsearch as the primary search engine for order data.

Initially the ES cluster was deployed with default settings on elastic cloud instances, resulting in a chaotic node layout and single‑point‑of‑failure risks.

To improve stability, the cluster was isolated onto dedicated physical machines, eliminating resource contention with other services.

Subsequently, replica tuning was performed: the default one‑primary‑one‑replica configuration was expanded to one‑primary‑two‑replicas, and additional nodes were added, boosting throughput and query performance.

Later, a master‑slave architecture was introduced, with a standby cluster that receives writes synchronously while the primary handles most traffic; the standby stores recent hot data (≈10% of primary volume) and can take over instantly during primary failures.

Finally, the system evolved into a real‑time dual‑cluster setup after upgrading the primary from ES 1.7 to ES 6.x, employing a seamless failover mechanism and a bi‑directional write strategy to ensure continuous service.

The article also compares two data‑sync approaches from MySQL to ES: (1) listening to binlog events and pushing changes to ES, which decouples the systems but adds a new service and maintenance overhead; (2) directly using the ES API in business code, which is simpler and lower‑latency but tightly couples the application to ES. JD.com chose the latter, supplemented by a compensation worker that retries failed writes.

Key operational pitfalls discussed include the near‑real‑time nature of ES refresh (making high‑freshness queries better served by the database), the performance impact of deep pagination (large from values cause heavy per‑shard processing), and the memory pressure of fielddata versus the more efficient doc‑values for sorting and aggregations.

Overall, the rapid architectural iterations driven by business growth illustrate that there is no single “best” design—only the most suitable one for current scale and requirements, with continuous optimization needed to handle ever‑increasing throughput and stability demands.

scalabilityElasticsearchHigh Availabilitydata synchronizationCluster Architecture
Selected Java Interview Questions
Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.