How JD Upgraded Its B‑Side Order Storage Architecture to Tackle Elasticsearch High‑Concurrency Pressure
Facing explosive merchant growth and soaring order volumes, JD redesigned its B‑side POP order storage by isolating large tenants, applying double‑hash routing, expanding clusters, buffering updates, and automating data archiving, ultimately delivering a high‑performance, scalable Elasticsearch platform that sustains massive traffic spikes.
System Overview
The POP order service provides merchant‑side order queries (non‑C‑end). It consists of a write service that consumes upstream MQ messages (order pipeline, binlog, OFW, reconciliation) and updates Elasticsearch (ES), and a read service that serves order‑list and order‑detail APIs. During peak periods the write path can handle up to 700 k updates per minute and the read path up to 500 k queries per minute .
Architecture layers:
Application layer – business processing.
Write service – consumes MQ, processes data, calls the proxy layer to update ES.
Read service – retrieves data from ES.
Proxy layer – routes orders to either the hot cluster or the archive cluster.
Storage layer – hot cluster (near‑real‑time data) and cold cluster (historical data).
Hot cluster details: 98 nodes (3 master, 10 gateway, 85 data) with 12 indices. Eleven KA indices (large merchants) have 1 primary shard + 1 replica; normal merchant indices have 96 primary shards + 1 replica. Cold cluster stores yearly indices (96 shards each) for data from 2016 to 2024.
Core Pain Points
Data skew : routing by merchant ID puts a few huge merchants (e.g., Jingxi) into a single shard that exceeds several terabytes. The largest shard is >5× the smallest, causing severe performance volatility and disaster‑level impact during failover.
Oversized shards : some shards have grown beyond 1 TB, far exceeding the ES recommendation of 50‑100 GB (≈50× the limit).
Frequent updates : order‑ES update frequency rose to 300 k operations per minute . Each new MQ adds at least one update, leading to many version‑conflict retries.
High maintenance cost : hot‑to‑archive migration must run twice a year. The process depends on upstream DB distribution and heavy segment merges, making it slow and error‑prone.
Solution for Data Skew
Provision dedicated clusters for large merchants. For each large merchant create an independent index with a shard count tuned to its volume. Add virtual routing logic in the proxy layer so that large‑merchant traffic is routed to the isolated cluster, eliminating cross‑merchant impact.
Solution for Oversized Shards
Expand the hot storage from a single ES cluster to three parallel clusters. The proxy layer hashes the merchant ID and distributes data evenly across the three clusters, keeping each shard within the recommended size range.
Solution for Frequent Updates
Introduce a buffering “shield” that aggregates incoming messages before writing to ES. The shield reduces the number of ES update operations and consequently lowers version‑conflict retries.
Solution for Maintenance Automation
Phase 1 – Reindex + archive‑message replay : migrate data with a reindex job and then replay archive messages to catch up. This cut migration time from ~20 days to ~10 days.
Phase 2 – Daily archive cache : create a daily archive ES index that stores the day’s order changes. A scheduled job reads the cache, compares with hot data, writes to the archive cluster, and deletes the original hot records, achieving fully automated migration without manual intervention.
Final Architecture
The upgraded platform combines tenant‑level isolation, double‑hash routing, differentiated shard strategies, and a dual‑active physical foundation to deliver a high‑performance, highly‑scalable, highly‑available enterprise‑grade order search and analysis system.
Appendix – Challenges of Ultra‑Large Clusters
When node count exceeds 100, ES’s Cluster State broadcast becomes a bottleneck:
Network storm – the master must push state updates to all nodes, saturating bandwidth and CPU.
Slow convergence – the cluster must wait for acknowledgments from the majority of nodes; “slow nodes” delay state changes.
Full GC risk – the Cluster State object can grow to dozens or hundreds of megabytes, increasing the chance of a stop‑the‑world Full GC on the master.
ES update mechanism (based on Lucene immutability) follows a three‑step “read‑soft‑delete‑insert” flow:
Read : fetch the old document (or its _source for partial updates).
Soft delete : mark the old document as deleted in the existing segment (.del tombstone).
Insert : write the modified document as a new document in a new segment with a new _version.
This immutable update pattern generates heavy disk I/O, CPU for re‑analysis, and frequent segment merges. The resulting pressures are:
Disk I/O & segment‑merge storms : continuous merges can saturate IOPS, trigger write throttling, and cause write rejections.
Query degradation due to tombstones : deleted docs remain in segments until merged, increasing scan volume, memory for BitSet tombstone tracking, and query latency.
Cache invalidation : each new segment invalidates the filter cache, preventing cache warm‑up and forcing disk reads.
GC pressure : update processing creates many temporary objects; combined with merge‑induced heap growth it can trigger frequent Young GC and occasional Full GC pauses.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
