Evolution of JD Daojia Order Center Elasticsearch Cluster: Architecture, Scaling, and Lessons Learned
This article details how JD Daojia's order center migrated from MySQL to a multi‑stage Elasticsearch cluster—covering initial deployment, isolation, replica tuning, primary‑secondary setup, real‑time dual‑cluster upgrades, data synchronization methods, and key pitfalls—to achieve massive scalability, high availability, and performance for billions of orders.
Background: JD Daojia's order center stores order data in MySQL but heavy read traffic led to using Elasticsearch for query load, handling billions of documents and hundreds of millions of daily queries.
Evolution of the ES cluster:
1. Initial stage: Deployed on elastic cloud with default settings, leading to single points of failure.
2. Cluster isolation stage: Moved nodes from resource‑contending clouds to dedicated high‑performance physical machines to improve stability.
As data grew, the elastic cloud could no longer meet requirements, prompting a migration to physical servers, which boosted performance.
3. Node replica tuning stage: Adopted one‑node‑per‑physical‑machine deployment, increased replica count from 1 to 2, and added machines, boosting throughput.
The architecture uses a VIP for external load balancing, a gateway layer acting as an ES client node, and a data layer with primary and two replica shards, achieving higher query capacity.
4. Primary‑secondary cluster stage: Built a standby cluster with asynchronous writes and Zookeeper‑controlled traffic switching, achieving higher query reliability.
The standby stores recent hot data (≈10% of primary volume) and can take over instantly when the primary fails, while the primary holds the full dataset for cold‑data queries.
5. Real‑time dual‑cluster stage: Upgraded primary from ES 1.7 to 6.x; during upgrade the standby acted as primary to avoid service disruption. The two clusters were re‑defined: the standby became a hot‑data cluster, the primary a cold‑data cluster.
Data synchronization approaches:
Option 1 – Binlog listener: low coupling but requires ROW mode and adds a new service, increasing development and maintenance cost.
Option 2 – Direct ES API writes: simple, low latency, chosen for real‑time order data. A compensation worker retries failed writes by inserting a task into the database and processing it asynchronously, ensuring eventual consistency.
Key pitfalls encountered:
1. High‑real‑time queries bypass ES and hit the DB because ES refresh interval introduces up‑to‑1‑second latency; the system uses default refresh settings, so time‑critical queries read directly from MySQL.
2. Deep pagination: using large from values forces each shard to build huge priority queues, causing CPU, memory, and bandwidth pressure; avoid deep pagination.
3. FieldData vs Doc Values: fielddata resides in JVM heap and can cause OOM; switching to doc values (default from ES 2.x) moves storage to Lucene files, improving stability.
Conclusion: Rapid architectural iteration driven by business growth has continuously optimized throughput, performance, and stability. The current dual‑cluster design balances hot and cold data, and future upgrades will further increase capacity and reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Dada Group Technology
Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
