Evolution of Elasticsearch Cluster Architecture for JD Daojia Order Center
This article details how JD Daojia's order center migrated its Elasticsearch cluster through multiple architectural stages—from an initial loosely configured setup to a real‑time dual‑cluster solution—addressing scalability, high availability, data synchronization, and performance optimization for billions of documents and hundreds of millions of daily queries.
In JD Daojia's order center, massive order query traffic required moving from MySQL‑only reads to Elasticsearch (ES) to handle high‑volume, low‑latency searches, with the ES cluster eventually storing over 1 billion documents and processing 500 million queries per day.
1. Initial Stage : The ES cluster was deployed on elastic cloud with default settings and mixed‑node placement, leading to single‑point failures and unstable performance.
2. Cluster Isolation Stage : To avoid resource contention with other services, the ES nodes were migrated off the shared cloud and later onto dedicated high‑spec physical machines, improving stability and performance.
3. Node Replica Optimization Stage : Each ES node was placed on its own physical server, and the replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, adding more machines to boost throughput.
The architecture uses a VIP for load balancing, with one primary shard set and two replica sets, distributing queries via round‑robin routing, which significantly increased query performance.
4. Primary‑Backup Cluster Adjustment Stage : To ensure high availability, a standby ES cluster was added. Data is written to the primary cluster synchronously and to the backup asynchronously; older orders are archived, and Zookeeper controls traffic switching, allowing seamless failover.
5. Current Real‑Time Dual‑Cluster Stage : The primary cluster was upgraded from ES 1.7 to 6.x, with the backup temporarily serving as primary during migration. The backup now holds hot recent data (≈10% of total), while the primary stores the full dataset, both capable of mutual failover.
ES Data Synchronization Solutions : Two approaches were considered—listening to MySQL binlog or using ES API directly. The team chose direct ES API writes for simplicity and lower latency, supplemented by a compensation worker that retries failed writes based on database records.
Encountered Pitfalls :
High‑real‑time queries are better served by MySQL due to ES's default 1‑second refresh interval.
Deep pagination (large from values) causes excessive memory and CPU usage; it should be avoided.
FieldData can cause heap pressure and timeouts; switching to Doc Values (default from ES 2.x) mitigates this issue.
Conclusion : Rapid business growth drove continuous ES architecture evolution, improving throughput, stability, and scalability. While no single solution is perfect, the current dual‑cluster design balances hot and cold data, ensuring the order center can handle future growth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
