Operations 14 min read

Evolution and Optimization of JD.com’s Order Center Elasticsearch Cluster Architecture

This article details how JD.com’s order center migrated its Elasticsearch cluster from a basic, mixed‑cloud deployment to a real‑time dual‑cluster backup solution, covering each architectural stage, scaling decisions, data‑sync strategies, and the performance pitfalls encountered along the way.

Architecture Digest

Nov 29, 2019

Evolution and Optimization of JD.com’s Order Center Elasticsearch Cluster Architecture

Background

In JD.com’s order‑to‑home business, order queries generate massive traffic, creating a read‑heavy, write‑light workload. While orders are stored in MySQL, relying solely on the database is insufficient, so Elasticsearch (ES) is used to handle the bulk of query traffic.

ES now stores over 1 billion documents with daily query volume reaching 500 million. The ES architecture has evolved through several stages to ensure stability and performance.

ES Cluster Evolution

1. Initial Stage

The cluster started with default settings on elastic cloud, with a chaotic node layout and single‑point failures.

2. Cluster Isolation Stage

Mixed‑deployment caused resource contention; high‑resource‑consuming nodes were moved off‑cloud, and eventually the cluster was migrated to dedicated high‑spec physical machines.

3. Node Replica Tuning Stage

To fully utilize hardware, each ES node was placed on its own physical server. Replica count was increased from 1 primary + 1 replica to 1 primary + 2 replicas, adding more machines to boost throughput.

The architecture uses a VIP for load‑balancing, a gateway layer (ES client node) as a smart balancer, and a data‑node layer for storage and processing.

Shard and replica configuration was tuned to balance single‑ID lookups (benefit from more shards) against aggregation‑heavy pagination queries (benefit from fewer shards).

4. Primary‑Backup Cluster Adjustment Stage

A standby cluster was introduced to take over when the primary fails. Data is written synchronously to the primary and asynchronously to the backup; older closed orders are archived, and Zookeeper controls traffic switching.

5. Current Stage: Real‑Time Dual‑Cluster Backup

The primary cluster was upgraded from ES 1.7 to 6.x via index recreation. During upgrade, the backup acted as primary to ensure zero downtime. The backup now stores recent hot data (≈10 % of primary size) and serves most query traffic, while the primary holds the full historical dataset.

A one‑click failover mechanism allows either cluster to become primary; writes follow a dual‑write strategy with automatic role reversal on failure.

ES Order Data Synchronization Scheme

Two approaches were considered:

1) Binlog listening – low coupling but adds a new service and risk.

2) Direct ES API writes – simple, flexible, but tightly coupled to business logic.

The team chose direct API writes, supplemented by a compensation worker that retries failed updates based on a remedial task table, ensuring eventual consistency.

Pitfalls Encountered

1. High‑Real‑Time Queries Use DB

Because ES refreshes every second, ultra‑low‑latency queries are routed to MySQL to guarantee up‑to‑date results.

2. Avoid Deep Pagination

Deep from/size queries cause each shard to build large priority queues, leading to high CPU, memory, and network usage; thus they should be avoided.

3. FieldData vs. Doc Values

FieldData consumes JVM heap and can cause OOM or slowdowns; switching to Doc Values (column‑store on disk) resolves these issues and is the default from ES 2.x onward.

Conclusion

The rapid iteration of the architecture mirrors the fast growth of JD.com’s business. While there is no single “best” design, the current dual‑cluster, replica‑tuned, and version‑upgraded ES setup delivers higher throughput, better performance, and stronger stability, and will continue to evolve as demand rises.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations JD.com Cluster Architecture real-time backup

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.