Big Data 12 min read

Evolution of JD.com Order Center Elasticsearch Cluster Architecture

This article details how JD.com's order center migrated its Elasticsearch cluster from a simple, default‑configured setup to a highly available, multi‑replica, dual‑cluster architecture with version upgrades, data synchronization strategies, and performance optimizations to support billions of documents and hundreds of millions of daily queries.

Big Data Technology & Architecture

Sep 11, 2020

Evolution of JD.com Order Center Elasticsearch Cluster Architecture

In JD.com’s order center, the volume of order queries is extremely high, leading to a read‑heavy, write‑light workload that cannot be efficiently served by MySQL alone.

To alleviate this, the system adopted Elasticsearch as the primary engine for order queries, handling over 1 billion documents and 500 million daily queries.

1. Initial Stage : The cluster was deployed on elastic cloud with default settings, mixed node placement, and a single point of failure.

2. Isolation Stage : To avoid resource contention with other services, the cluster was moved from the elastic cloud to dedicated high‑performance physical machines.

3. Replica Tuning Stage : Each node was placed on a separate physical machine, and the replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, improving throughput via load‑balanced queries.

4. Master‑Slave Adjustment Stage : A standby cluster was introduced; data is written synchronously to the primary cluster and asynchronously to the standby. Zookeeper controls traffic switching, allowing seamless failover.

5. Current Real‑Time Dual‑Cluster Stage : The primary cluster was upgraded from ES 1.7 to ES 6.x, requiring index recreation. During upgrades, the standby cluster temporarily serves as the primary to ensure zero downtime.

The standby cluster stores recent hot data (≈10% of primary size) and handles most query traffic, while the primary cluster stores the full dataset for cold or special queries.

Data Synchronization : Two approaches were considered—MySQL binlog listening and direct ES API writes. The system chose direct API writes for simplicity and low latency, supplemented by a compensation task that retries failed writes based on database records.

Encountered Pitfalls :

High‑real‑time queries still hit the database to guarantee freshness.

Deep pagination (large from values) causes excessive resource consumption; avoiding deep pagination is recommended.

FieldData consumes JVM heap and can cause OOM during sorting; switching to Doc Values (default since ES 2.x) resolves this.

Overall, the architecture evolved rapidly to meet business growth, continuously improving throughput, stability, and scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Elasticsearch Data synchronization Order Management Cluster Architecture

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.