Scaling JD Daojia Order Search with Elasticsearch: Cluster Evolution Journey
JD Daojia’s order center faced massive query loads, prompting a shift from MySQL to Elasticsearch and a multi‑stage evolution of its ES cluster—from an initial loosely configured setup, through isolation, replica tuning, master‑slave adjustments, to a real‑time dual‑cluster architecture—enhancing stability, throughput, and scalability.
JD Daojia's order center system experiences extremely high query volumes, making MySQL alone insufficient for order searches; therefore, Elasticsearch (ES) is adopted to handle the primary query load.
ES provides near‑real‑time storage and search, currently managing over 1 billion documents and handling around 5 billion queries daily.
ES Cluster Architecture Evolution
1. Initial Stage
The ES cluster started with default configurations on elastic cloud, with chaotic node deployment and single‑point failure risks.
2. Cluster Isolation Stage
Mixed deployment caused resource contention; high‑resource‑consuming nodes were migrated off the elastic cloud to dedicated physical machines, improving stability.
3. Node Replica Tuning Stage
Each ES node was placed on a separate physical machine to maximize resource usage. Replicas were increased from one to two, and additional machines were added, boosting throughput.
4. Master‑Slave Adjustment Stage
A standby cluster was introduced for high availability. Business double‑writes sync data to both primary and backup clusters; an archival mechanism moves older orders to a history store. ZooKeeper controls traffic switching, ensuring queries can fall back to the backup cluster when needed.
5. Current Stage: Real‑Time Dual‑Cluster
The primary ES cluster was upgraded from version 1.7 to 6.x, requiring index rebuilding. During upgrades, the backup cluster temporarily serves as the primary to avoid downtime. The backup cluster stores recent hot data (≈10% of primary size) and handles most query traffic, while the primary stores the full dataset for less frequent, full‑order searches.
ES Order Data Synchronization
Option 1: Listen to MySQL binlog and sync to ES (asynchronous, adds system complexity).
Option 2: Directly write to ES via its API (synchronous, simpler, meets real‑time needs).
The team chose the direct API approach. If a write fails, a compensating task is recorded in the database; a worker later retries the ES update to ensure eventual consistency.
Common Pitfalls
1. High Real‑Time Query Requirements
ES refreshes shards every second, so newly indexed documents may not be instantly searchable; critical real‑time queries therefore fall back to the database.
2. Avoid Deep Pagination
Large from values cause each shard to build huge priority queues, consuming CPU and bandwidth; thus deep pagination should be avoided.
3. FieldData vs. Doc Values
FieldData stores sorting data in JVM heap, leading to possible OOM and latency spikes; switching to Doc Values (column‑oriented storage on disk) mitigates this issue.
Conclusion
The rapid iteration of the ES architecture mirrors JD Daojia's fast business growth; continuous optimization—through isolation, replica tuning, dual‑cluster design, and version upgrades—has significantly improved throughput, performance, and stability, though the optimal solution remains context‑dependent.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
