Operations 14 min read

How JD Daojia Scaled Its Order Search with a Real‑Time Dual Elasticsearch Cluster

This article details how JD Daojia’s order center migrated from MySQL‑only reads to a multi‑stage Elasticsearch architecture, describing each evolution step, data‑sync strategies, performance pitfalls, and the final real‑time active‑passive cluster that ensures high availability for billions of daily queries.

dbaplus Community
dbaplus Community
dbaplus Community
How JD Daojia Scaled Its Order Search with a Real‑Time Dual Elasticsearch Cluster

Background

Order queries in JD Daojia’s order center generate billions of reads per day, far exceeding the capacity of a MySQL‑only architecture. To offload read‑heavy workloads and support complex search features, the system introduced Elasticsearch as the primary search layer. The ES cluster now stores on the order of 1 billion documents and handles ~5 hundred million queries daily.

Evolution of the ES Cluster

Initial stage – Deployed a single Elasticsearch cluster on elastic cloud with default settings. Nodes were mixed on shared machines, leading to single‑point failures and unstable performance.

Isolation stage – Migrated high‑resource nodes off the shared cloud and eventually moved the entire cluster to dedicated physical servers to eliminate resource contention.

Node‑replica tuning stage – Assigned each ES node to its own physical server, increased the replica count from 1 to 2 (one primary, two replicas) and added additional machines, which raised throughput and fault tolerance.

Primary‑secondary (master‑slave) stage – Added a standby cluster. Writes are dual‑written to both clusters; ZooKeeper controls traffic switching so that the standby can take over queries when the primary fails.

Real‑time mutual‑backup stage – Upgraded the primary cluster from ES 1.7 to 6.x, rebuilt indices, and re‑partitioned data: the secondary cluster stores hot (recent) data while the primary holds cold (historical) data. This enables seamless failover and balanced load.

Cluster Architecture Details

The cluster uses a two‑layer topology:

Gateway layer – client nodes that act as intelligent load balancers (VIP‑based) and forward requests to data nodes.

Data layer – data nodes that store shards and execute indexing/search operations.

Sharding strategy follows a 1‑primary‑2‑replica model. Increasing the number of replicas and adding physical machines improves query throughput. However, shard count must be balanced: more shards increase horizontal scalability and single‑ID query throughput but degrade aggregation and deep‑pagination performance. The final configuration was chosen after extensive load‑testing to achieve optimal throughput for the mixed workload of single‑ID and paginated queries.

Data Synchronization Strategy

Two approaches were evaluated for keeping Elasticsearch in sync with the MySQL order database:

Binlog listener – Capture MySQL binlog events, parse them, and push changes to ES. This decouples the systems but requires a dedicated sync service and introduces additional operational risk.

Direct ES API writes – Application code writes to ES via its REST API at the same time it writes to MySQL. This is simpler and meets the real‑time requirement.

The team selected the API‑based method. To handle occasional write failures, a compensation worker was added: when an ES write fails, a task record is inserted into a retry table; a background worker periodically scans the table and re‑issues the failed writes until they succeed, guaranteeing eventual consistency between MySQL and ES.

Common Pitfalls and Mitigations

Refresh latency – Elasticsearch refreshes shards by default every second. Queries that require strict real‑time freshness (e.g., sub‑second) are still routed to MySQL.

Deep pagination – Using large from values forces each shard to build a priority queue of size from + size, consuming CPU and network bandwidth. Avoid deep pagination; use search‑after or scroll APIs for large result sets.

FieldData vs. Doc Values – Sorting on keyword fields in ES 1.x uses FieldData, which resides in JVM heap and can cause OOM or long GC pauses. Switching to Doc Values (column‑oriented storage on disk) moves the data out of heap and resolves timeout issues. Doc Values are the default from ES 2.x onward.

Conclusion

The Elasticsearch architecture has been iteratively refined to match rapid business growth. The current dual‑cluster, hot‑cold data split with dual‑write and ZooKeeper‑controlled traffic switching provides a balance of throughput, latency, and high availability. Ongoing tuning of shard count, replica factor, and query patterns will continue to evolve as order volume increases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationElasticsearchhigh availabilitydata-syncCluster Architecture
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.