Engineering Wisdom Behind High‑Availability Architecture for E‑Commerce Storage Layers
The article analyzes how to design a high‑availability architecture for large‑scale e‑commerce systems, detailing layered risk isolation, stateful storage strategies for flow and state data, unified document‑ID routing, multi‑replica databases, multi‑datacenter synchronization, and real‑world JD case studies that demonstrate elastic scaling and disaster recovery.
This article examines the design of a high‑availability (HA) architecture for e‑commerce systems, emphasizing the construction of HA for the stateful storage layer.
HA Architecture Paradigm
The core goal of HA is to keep services running despite hardware failures, software bugs, or network interruptions, minimizing downtime and ensuring business continuity and data consistency. Achieving this typically involves layered risk isolation, redundant data disaster‑recovery, and failover mechanisms.
Layered System Overview
Frontend layer : Uses CDN or edge caching to serve static, stateless resources; redundancy improves performance and disaster recovery.
Gateway layer : Provides load balancing and request forwarding; stateless and requires rate‑limiting and circuit‑breaker to prevent cascading failures.
Service layer : Micro‑service architecture with multiple instances; services communicate synchronously or asynchronously and remain stateless.
Storage layer : Supplies relational, NoSQL, and search storage; employs sharding for throughput and master‑slave replication for disaster recovery. It is the only stateful component and must address throughput, read/write performance, node‑failure isolation, replication lag, and rapid backup recovery.
Characteristics of E‑Commerce Business Data
E‑commerce generates two main data categories:
Document‑type (flow) data : Orders, payment records, logistics records, etc. These are generated sequentially without inter‑record dependencies, forming a high‑throughput, flow‑type workload.
State data : User profiles, product information, inventory, coupons, etc. These are read‑heavy with occasional writes that must be strongly consistent.
A comparison table (converted to text) shows that document data is write‑dominant with a high creation‑to‑update ratio, while state data is read‑dominant and requires strong consistency for certain business scenarios.
3.1 Flow‑Data HA Upgrade
The primary objective is business‑transparent storage scaling and unified disaster recovery across the entire link. Because flow data has no dependencies, new records can be written directly to a newly provisioned database when capacity is insufficient or a failure occurs, enabling seamless scaling or failover.
Two key upgrades are required:
Unified document‑ID generation rule : Each document receives a unique ID that embeds routing information indicating the target database.
Routing databases based on document ID : The embedded routing info directs the record to the appropriate storage node.
During runtime, the system can dynamically change the ID generation strategy to route new flow records to a new database, achieving elastic scaling and disaster recovery without affecting the business.
3.2 State‑Data HA Exploration
State data is divided into two sub‑categories:
Read‑many‑write‑few data (e.g., product, inventory, user info): Implement a one‑write‑many‑read architecture where writes go to the database and reads are served primarily from cache with real‑time synchronization. Cache node failures trigger master‑slave failover; database failures trigger master‑slave switch.
Strongly consistent read‑write data (e.g., coupons, red packets): Require both reads and writes to be strongly consistent. Use sharding plus isolation of undecided data, with master‑slave replication and black‑list routing to avoid dirty writes during failover.
3.3 Database Multi‑Replica HA Assurance
Each storage node employs a three‑replica (one master, two slaves) setup distributed across three availability zones for risk isolation. Replication uses semi‑synchronous mode: a transaction is considered committed only after the log reaches at least one slave, ensuring data durability even if the master fails. HA components monitor health and perform rapid master election and topology reconstruction to achieve sub‑second failover.
4 Multi‑Datacenter HA Construction
JD’s data centers in Beijing and Suqian operate a multi‑active architecture. Logical units are split by user dimension and placed in both sites. Traffic is fully converged within each site; inter‑site traffic is switched at the logical‑unit level.
Challenges include network latency causing replication delay and the risk of data loss or service interruption during cross‑site switchover. The solution routes newly created flow records to a brand‑new database in the target site, avoiding the need to wait for cross‑city replication and guaranteeing 100% business continuity for new records. Since updates to existing data constitute less than 10% of traffic and have second‑level latency, their impact on continuity is minimal.
5 Business HA Architecture Upgrade Cases
5.1 Delivery System Database Upgrade
In 2025 JD expanded into food delivery. The original system had a single‑point storage bottleneck. By redesigning document IDs and routing, and applying dual‑write with gray‑release, the storage was migrated to a distributed architecture within one month, achieving elastic scaling and disaster recovery without business impact.
5.2 Core‑Link Document Data Unified Upgrade
Also in 2025, JD upgraded the core‑link document data architecture, unifying the document‑ID generation and routing rules, which enabled elastic expansion and unified disaster recovery for all core‑link services.
5.3 Payment System Multi‑Active Deployment
The payment system, with strict financial HA requirements, adopted the same flow‑data routing mechanism to achieve RPO = 0 and RPO < 10 s across Beijing and Suqian. New flow records are routed to a fresh database, ensuring uninterrupted service and zero data loss during cross‑city switchover.
Conclusion
In a full‑link HA architecture, the storage layer is the sole stateful component and thus determines overall business continuity, data reliability, and scalability. By applying unified document‑ID generation, ID‑based database routing, read‑many‑write‑few caching for state data, and multi‑replica, multi‑datacenter designs, JD has built a highly extensible, highly available distributed active‑active system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
