Evolution of JD Daojia Order System Elasticsearch Cluster Architecture
This article details the step‑by‑step evolution of the JD Daojia order‑center Elasticsearch cluster—from an initial loosely configured deployment to a real‑time dual‑cluster architecture with replica tuning, master‑slave adjustments, data‑sync strategies, and lessons learned about pagination, fielddata, and doc values—highlighting how each phase improved query throughput, stability, and scalability for billions of documents.
In JD Daojia's order center, the massive volume of order queries makes MySQL alone insufficient, so Elasticsearch (ES) is used to handle the primary query load, currently storing over 1 billion documents and serving 500 million queries per day.
1. Initial Stage : The ES cluster was deployed on elastic cloud with default settings and a chaotic node layout, leading to single‑point failures that were unacceptable for order processing.
2. Cluster Isolation Stage : Mixed‑deployment caused resource contention; high‑resource‑consuming nodes were migrated off the elastic cloud, and eventually the cluster was moved to dedicated high‑spec physical machines for better isolation and performance.
3. Node Replica Tuning Stage : To maximize hardware utilization, each ES node was placed on its own physical server. Replica count was increased from one primary‑one replica to one primary‑two replicas, and additional machines were added, improving throughput.
4. Master‑Slave Adjustment Stage : A standby cluster was introduced for high availability. Data is written to the primary cluster synchronously and to the standby asynchronously; a Zookeeper‑controlled switch allows traffic to fail over to the standby when the primary experiences issues.
5. Current Real‑Time Dual‑Cluster Stage : The primary cluster was upgraded from ES 1.7 to 6.x, requiring index recreation. During upgrades, the standby cluster temporarily serves all queries. The standby now stores recent hot data (≈10% of primary size) and acts as a hot‑data cluster, while the primary stores the full historical dataset as a cold‑data cluster.
Data Synchronization : Two approaches were considered—listening to MySQL binlog or using the ES API. The team chose direct ES API writes for simplicity and lower latency, supplemented by a compensation worker that retries failed writes based on a task table to ensure eventual consistency.
Pitfalls Encountered :
High‑real‑time queries are routed to MySQL because ES refresh intervals (default 1 s) make it near‑real‑time, not true real‑time.
Deep pagination (large from values) creates large priority queues on each shard, consuming CPU and memory; it should be avoided.
FieldData consumes JVM heap and can cause OOM or slow queries; switching to Doc Values (column‑store) reduces heap usage and improves stability.
Summary : Rapid architectural iterations driven by business growth have transformed the ES cluster from a basic setup to a robust, dual‑cluster, high‑availability system with optimized replica settings, data‑sync mechanisms, and performance‑tuned configurations, illustrating that there is no perfect architecture—only the most suitable one for current needs.
Laravel Tech Community
Specializing in Laravel development, we continuously publish fresh content and grow alongside the elegant, stable Laravel framework.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.