Operations 22 min read

Designing Suning’s Multi‑Data‑Center Active‑Active Architecture for Scalable E‑Commerce

Suning built a multi‑data‑center active‑active solution that combines primary‑backup, same‑city active‑active, and full multi‑active modes, defines top‑level design goals, values and principles, and implements a comprehensive architecture, routing, high‑availability, hybrid‑cloud and disaster‑recovery strategy to support massive e‑commerce growth.

Suning Technology

Oct 30, 2020

Designing Suning’s Multi‑Data‑Center Active‑Active Architecture for Scalable E‑Commerce

1. Problem Overview

Suning’s offline and online businesses, together with all‑industry and all‑format models, have grown rapidly, especially during major sales events such as the 818 promotion and Double‑11, causing sales orders to increase by multiples and requiring massive resource expansion. A single data center can no longer meet capacity or high‑availability needs, as failures would disrupt services and user access.

2. Solution Options

Primary‑Backup mode

Same‑city active‑active

Multi‑active mode

Key Concepts

Cell: the smallest execution shard that can be closed for a business, partitioned by dimensions such as member or store.

LDC (Logical Data Center): a collection of cells with independent middleware (RPC, MQ, DNS, etc.) and network exits.

PDC (Physical Data Center): a physically independent building containing racks and thousands of servers.

AZ (Available Zone): an isolated fault‑domain with independent network or power, composed of one or more PDCs.

Region: a set of multiple AZs with complete fault isolation between regions.

1. Primary‑Backup Mode

The primary data center provides services while the backup does not; when the primary fails, the backup takes over.

2. Same‑City Active‑Active

A single cluster spans two different AZs in the same city, both providing services simultaneously and allowing cross‑data‑center access to different services and databases.

3. Multi‑Active Mode

Multiple data centers provide services concurrently; business requests are preferably converged to a single data center, and if one fails, other centers take over.

4. Solution Comparison

Considering Suning’s online/offline transaction and payment characteristics and the need for geographically distributed data centers, the multi‑active mode was selected after technical evaluation.

3. Top‑Level Design

Goals

Horizontal expansion of data centers to support rapid business growth and resource demand.

City‑level and cross‑region high availability so that a single‑center failure can quickly shift traffic to other normal centers.

Values

Support fast business development by scaling data‑center capacity.

City‑level and cross‑region disaster recovery through rapid traffic switching.

Hybrid‑cloud cost reduction: during peak sales, traffic is shifted to public cloud, lowering long‑term private‑cloud costs.

Gray release: gradual, data‑center‑level traffic rollout to reduce impact of version failures.

Principles

Transactions of the same user should be completed within a single data center.

Business should be unaware of the multi‑data‑center deployment; services operate transparently across centers.

Minimize resource waste caused by multi‑center deployment.

4. Architecture Design

Related Concepts

Sharding Service: data exists only in a specific Cell (e.g., member or order service).

Shared Service: all Cells share the same data (e.g., pricing or product service).

Index Service: provides indexing similar to shared services.

Competition (Control) Service: ensures data consistency by operating within a single data center (e.g., inventory deduction).

Competition Proxy Service: front‑end for competition services (e.g., inventory pre‑allocation).

Service Architecture

Users are distributed across different data centers; each center provides services, and traffic can be switched to another normal center when one fails.

Service Planning: split services into sharding, shared, competition, index, control, and management types, each with independent routing and gray‑release support.

Unified Service Routing: from entry layer to service layer to data layer, a unified routing strategy ensures a user’s transaction stays within one data center.

Data High Availability: all data centers hold full data sets; changes in the primary are synchronized to all replicas.

Service Routing

Routing components include:

DNS: routes users to the nearest CDN based on location.

CDN: routes requests to the appropriate data center according to defined rules.

SLB: forwards requests to the same or another data center.

RPC/MQ: distributes requests across data centers according to routing policies.

DAL: validates shard placement to avoid data anomalies.

Service Convergence

To keep a user’s transaction within a single data center and reduce cross‑center latency, traffic is precisely dispatched before entering the data center, ensuring the target center is determined by routing rules. For example, Suning performs the initial routing at the CDN layer.

Inside the data center, multiple routing strategies (entry layer, RPC, DAL, etc.) keep the same user’s requests converged in one center, avoiding cross‑center calls.

Data High Availability

All data centers contain the full data set; changes in the primary are replicated in real time to all replicas.

Inter‑center latency is higher than intra‑center latency, so replication is typically asynchronous. In extreme cases where some data is not synchronized, manual repair is performed after the failed center recovers.

5. Technical Challenges

High‑Availability Implementation

High availability is addressed at two levels:

Within a single data center

Cluster‑level HA: stateless services use N+1 deployment; any failure is covered by other instances.

Stateful services (e.g., databases) use 2N (primary‑secondary) or 3N (primary‑two‑secondaries) deployment for second‑level failover.

Across multiple data centers

Same‑city HA for a single system: planned or unplanned failures switch to another center.

Full‑link same‑city HA: entire data‑center failures switch to another center.

Full‑link cross‑region HA: extreme scenarios (e.g., earthquakes) allow remote center takeover.

Switch‑over time is generally measured in minutes.

HA Metrics

RPO (Recovery Point Objective): time of unsynchronized data during a data‑center failure; set to minutes for MySQL replication delays, seconds under normal conditions.

RTO (Recovery Time Objective): time to switch critical processes or systems after a failure, typically minutes.

WRT (Work Recovery Time): time to manually repair data not synchronized due to RPO, usually hours.

HA Practices

Service Switch

Data Replication Topology

Two main replication patterns for sharded data across data centers:

Unidirectional cross‑replication: each shard’s primary cluster replicates to a secondary cluster in the other data center (master‑slave).

Bidirectional replication: both clusters act as masters, providing write capability in both data centers.

Suning initially adopted the unidirectional cross‑replication topology to ensure data consistency.

Data Migration & Reorganization

To distribute data by user Cell across data centers and address legacy issues (e.g., tables >1 billion rows), Suning performed snapshot‑based re‑sharding, real‑time binlog extraction for incremental data, and gray traffic migration via DAL until all data resides in the new clusters.

Service Switch

Sharded services switch according to Cell grouping rules, ensuring both service and data transition together without write anomalies.

Cross‑Data‑Center Link Switch

All multi‑data‑center traffic distribution and service scheduling are handled by the underlying middleware platform, making the switch transparent to business and drastically reducing recovery time.

Steps:

Promote the standby replica to primary.

Switch write operations to the new primary.

Redirect SLB/RPC/MQ services to the new primary.

Reroute CDN traffic to the new primary.

Gray Deployment & Release

To ensure topology and configuration changes do not affect the whole system, a gray‑deployment approach is used:

Deploy core components (RPC, MQ, WAF, databases, etc.).

Deploy business systems (order, member, promotion, product, etc.).

After component and system deployment, traffic is gradually shifted from internal whitelist users to single‑system traffic, and finally to full‑link traffic.

Full‑Link Monitoring

Suning’s monitoring platform covers logs, metrics, and tracing across all data centers, using federated queries to avoid cross‑center bandwidth overhead.

Fault‑Injection Drills

Chaos engineering gradually expands the blast radius to simulate failures such as single‑system outages, full‑link failures, network device failures, and entire data‑center power failures, verifying the multi‑active system’s disaster‑recovery capability while keeping business impact controllable.

6. Expansion

Remote Deployment

Because same‑city resources are limited and cannot address extreme events like earthquakes, remote data‑center deployment is required. Remote deployment introduces higher latency and bandwidth constraints, so compression and traffic‑shaping are applied to keep data transfer within controllable limits.

Hybrid‑Cloud Deployment

During peak sales, traffic can be burst to public cloud, leveraging its elasticity to reduce private‑cloud holding costs. Hybrid‑cloud introduces additional capabilities:

Security control between public and private clouds.

Rapid provisioning to lower cost of on‑demand resources.

Asymmetric deployment to balance cost and performance between clouds.

7. Summary

The multi‑data‑center active‑active project, after roughly three years of construction, went live in 2019 and has withstood major sales events such as 818 and Double‑11, smoothly transitioning critical links from a single data center to multiple centers and supporting rapid business growth. Hybrid‑cloud efforts, also a year in development, are now mature and will further accelerate Suning’s business while reducing expansion costs.

e-commerce high availability disaster recovery Hybrid Cloud cloud architecture Multi-Data Center

Written by

Suning Technology

Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.