Backend Development 13 min read

How JD.com Scaled Its Order Search with Elasticsearch: Lessons from a Billion-Document Cluster

This article details JD.com’s journey of evolving the Elasticsearch cluster for its order system—from a chaotic initial deployment to a real‑time dual‑cluster architecture—highlighting scaling strategies, data sync methods, and the performance pitfalls they overcame.

Java Backend Technology

Aug 27, 2020

How JD.com Scaled Its Order Search with Elasticsearch: Lessons from a Billion-Document Cluster

ES Cluster Architecture Evolution

In JD Daojia’s order center, both external merchant order creation and internal system dependencies generate massive query traffic, leading to a read‑heavy workload on MySQL. To avoid overloading MySQL and to support complex queries, the system adopted Elasticsearch as the primary query engine.

Elasticsearch now stores over 1 billion documents and handles 500 million queries per day. As business grew, the ES architecture evolved through several stages.

1. Initial Stage

The cluster was initially deployed on elastic cloud with default settings, resulting in a single‑point‑of‑failure design and a chaotic node layout.

2. Cluster Isolation Stage

Mixed‑tenant deployments caused resource contention. High‑resource‑consuming nodes were moved off the elastic cloud, and eventually the cluster was migrated to dedicated high‑spec physical machines for better stability.

3. Node Replica Optimization

To maximize hardware utilization, each ES node was placed on its own physical machine. The replica factor was increased from 1 primary + 1 replica to 1 primary + 2 replicas, and additional machines were added to boost throughput.

4. Primary‑Secondary Cluster Adjustment

A standby cluster was introduced to ensure continuity during primary failures. Data is written synchronously to the primary and asynchronously to the standby. Historical orders are archived, and the standby stores only recent hot data, while the primary holds the full dataset.

5. Current Real‑Time Dual‑Cluster Stage

The primary cluster was upgraded from ES 1.7 to ES 6.x, requiring index recreation. During upgrades, the standby temporarily assumes the primary role to avoid service disruption. The standby now handles hot‑data queries, while the primary serves cold‑data and full‑search scenarios.

ES Order Data Synchronization

Two synchronization approaches were considered:

Listening to MySQL binlog and pushing changes to ES (asynchronous, higher latency).

Directly writing to ES via its API (synchronous, lower latency).

The API‑based method was chosen for its simplicity and real‑time characteristics. A compensation mechanism records failed writes in MySQL; a worker later retries to ensure eventual consistency.

Common Pitfalls

1. High‑Latency Queries Should Use DB

Because ES refreshes shards every second, near‑real‑time visibility may not meet strict latency requirements; critical queries are therefore routed to MySQL.

2. Avoid Deep Pagination

Deep pagination (large from values) forces each shard to build large priority queues, consuming CPU and memory; it should be avoided.

3. FieldData vs. Doc Values

Sorting on older ES versions used FieldData, which occupies JVM heap and can cause OOM. Switching to Doc Values stores sorting data on disk, improving stability.

Summary

The rapid iteration of the architecture mirrors JD Daojia’s fast business growth. Continuous optimization—moving from a single cluster to a real‑time dual‑cluster, adjusting replica counts, upgrading versions, and refining sync mechanisms—has dramatically increased throughput, performance, and stability, and the journey will continue as data volumes expand.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch High Availability Cluster Architecture backend scaling

Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.