Operations 13 min read

Evolution of the Elasticsearch Cluster Architecture in JD.com Order System

This article details how JD.com’s order center migrated its Elasticsearch cluster from a basic, mixed‑node setup to a real‑time, dual‑cluster architecture with increased replicas, physical isolation, version upgrades, and a robust data‑sync strategy to handle billions of documents and hundreds of millions of daily queries.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Evolution of the Elasticsearch Cluster Architecture in JD.com Order System

ES Cluster Architecture Evolution

In JD.com’s order center, the volume of order queries is huge, leading to a read‑heavy, write‑light workload that cannot be supported by MySQL alone, so Elasticsearch is used to bear the main query pressure.

Elasticsearch, a powerful distributed search engine, stores over 1 billion documents and handles about 500 million queries per day. The architecture has evolved through several stages to ensure stability and performance.

1. Initial Stage

The cluster started with default configurations on elastic cloud, with mixed‑node deployment and single‑point failure risks.

2. Cluster Isolation Stage

To avoid resource contention, high‑resource nodes were migrated out of the elastic cloud, and eventually the cluster was moved to dedicated physical machines, improving performance.

3. Node Replica Tuning Stage

Each node was placed on a separate physical machine to avoid resource competition. The replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, and additional machines were added, boosting throughput.

The cluster uses a VIP for load‑balancing external requests, and the added replica set increases query capacity.

4. Primary‑Secondary Cluster Adjustment Stage

A standby cluster was introduced to take over when the primary cluster fails, with data synchronized via dual‑write and asynchronous backup mechanisms, ensuring high availability.

5. Current Stage: Real‑Time Mutual Backup Dual‑Cluster

The primary cluster was upgraded from ES 1.7 to 6.x, requiring index rebuilding. During upgrade, the standby cluster served all queries to avoid downtime.

The standby cluster stores recent hot data (≈10% of total), while the primary holds the full dataset, effectively separating hot and cold data workloads.

ES Order Data Synchronization Scheme

Two approaches were considered: listening to MySQL binlog or directly writing via ES API. The latter was chosen for simplicity and low latency.

When write failures occur, a compensating task is inserted into the database; a worker later retries the ES update to ensure eventual consistency.

Encountered Pitfalls

1. High‑real‑time queries should bypass ES and hit the DB directly. ES refreshes every second, so near‑real‑time but not instantaneous.

2. Avoid deep pagination. Large from/size values cause each shard to build huge priority queues, consuming CPU and memory.

3. FieldData vs. Doc Values. FieldData resides in JVM heap and can cause OOM; Doc Values store data on disk and are the default from ES 2.x onward.

Conclusion

The rapid iteration of the architecture is driven by business growth. While there is no “perfect” solution, the current design balances throughput, performance, and stability, and will continue to evolve as order volume increases.

Author : Zhang Sir, JD.com R&D Engineer responsible for order center, merchant center, and billing systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationElasticsearchhigh availabilitydata synchronizationCluster Architecture
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.