Backend Development 13 min read

Evolution of JD.com Order Center Elasticsearch Cluster Architecture

This article details how JD.com’s order center migrated its Elasticsearch cluster through multiple stages—from an initial unoptimized deployment to a real‑time dual‑cluster backup solution—addressing scalability, reliability, shard tuning, version upgrades, and data synchronization strategies to support billions of documents and hundreds of millions of daily queries.

JD Tech
JD Tech
JD Tech
Evolution of JD.com Order Center Elasticsearch Cluster Architecture

Background

In JD.com’s order center, the volume of order queries is extremely high, leading to a read‑heavy, write‑light workload. Storing order data solely in MySQL cannot sustain the query load, and complex queries are not well supported, so Elasticsearch (ES) was introduced to handle the primary query pressure.

Elasticsearch, a powerful distributed search engine, now stores over 1 billion documents with a daily query volume of 500 million. The ES cluster architecture has evolved through several phases to ensure stable read/write performance.

ES Cluster Evolution

1. Initial Stage

The cluster was deployed on elastic cloud with default settings, resulting in a chaotic node layout and single‑point failures.

2. Cluster Isolation Stage

Resource contention caused service instability; high‑resource‑consuming nodes were migrated off the elastic cloud, and eventually the cluster was moved to dedicated high‑spec physical machines for better isolation and performance.

3. Node Replica Tuning Stage

To fully utilize hardware, each ES node was placed on a separate physical machine. Replica count was increased from one primary‑one replica to one primary‑two replicas, adding more physical machines and improving throughput.

4. Primary‑Backup Adjustment Stage

A standby cluster was introduced for high availability. Data is written synchronously to the primary cluster and asynchronously to the backup cluster; older orders are archived to reduce backup size. ZooKeeper controls traffic switching between clusters.

5. Current Real‑Time Dual‑Cluster Stage

The primary cluster was upgraded from ES 1.7 to 6.x via index rebuilding. During upgrades, the backup cluster temporarily serves all queries. The backup now stores hot recent data (≈10% of primary size) and handles most query traffic, while the primary stores cold full‑history data.

ES Data Synchronization

Two approaches were considered: (1) listening to MySQL binlog and syncing to ES (low coupling but adds a service and risk), and (2) directly writing to ES via its API (tight coupling but simpler). The order center chose the API method for real‑time needs, supplemented by a compensation worker that retries failed writes based on database records.

Challenges Encountered

1. High‑real‑time queries are routed to MySQL because ES refresh latency (≈1 s) is insufficient for the most recent data.

2. Deep pagination is avoided because large from values cause heavy memory and CPU usage across shards.

3. Fielddata caused occasional timeouts due to heap pressure; switching to doc values (column‑oriented storage) resolved the issue.

Summary

The rapid architectural iterations were driven by fast business growth; each optimization improved throughput, performance, and stability, and future upgrades will continue to evolve the system to meet increasing demands.

Search EngineElasticsearchdata synchronizationJD.comCluster Architecturebackend scaling
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.