Cloud Native 18 min read

Ele.me's Multi‑Active Architecture: Design Principles, Core Components and Implementation Overview

This article explains how Ele.me built a multi‑active, geographically distributed system that enables elastic scaling and data‑center‑level disaster recovery by partitioning services, routing traffic, replicating data in real time, and enforcing strict consistency and availability principles.

Architecture Digest
Architecture Digest
Architecture Digest
Ele.me's Multi‑Active Architecture: Design Principles, Core Components and Implementation Overview

Ele.me's technical team spent over a year implementing a fully distributed multi‑active architecture that can dynamically schedule users across multiple data‑center zones, achieving elastic expansion and cross‑site disaster recovery.

Background: Why Multi‑Active?

Rapid business growth exhausted a single data center’s capacity and frequent site‑level failures threatened service continuity, prompting the need for a solution that can both scale across data centers and survive whole‑site outages.

The two main goals are:

Enable services to expand to multiple data‑center zones.

Ensure the system can tolerate an entire site failure.

These goals are typically addressed by deploying services in multiple locations (multi‑active). Ele.me’s specific constraints make a multi‑active approach essential.

Design: Implementation Ideas and Methods

The design follows several core principles: business cohesion (an order’s entire lifecycle stays within one zone), availability‑first (prioritise service continuity even with temporary data inconsistency), data correctness (lock conflicting orders during failover), and business awareness (code must recognise its own zone and handle inconsistencies).

Service Partitioning (Sharding)

Geographic location is used as the sharding key, grouping users, merchants and couriers that are close together into the same "eZone" so that an order can be processed entirely within one zone, minimizing latency.

Custom geographic fences divide the country into shards that roughly follow provincial borders; each eZone may contain multiple shards and can be reassigned as needed.

Traffic Routing

An API Router deployed in public‑cloud regions receives client requests, extracts a routing tag (usually geographic), maps it to a Shard ID, then forwards the request to the appropriate eZone. A SOA Proxy provides the same routing logic for internal service‑to‑service calls.

Data Replication

All zones hold a full copy of the data. Real‑time replication middleware synchronises MySQL, ZooKeeper, message queues and Redis across zones. The MySQL replication tool DRC assigns each zone a unique ID space to avoid primary‑key collisions and resolves write conflicts using timestamps.

Strong consistency for critical workloads is achieved with a Global Zone service that centralises writes to a master zone while allowing reads from any zone.

L1 cache reference......................... 0.5 ns
Branch mispredict............................ 5 ns
L2 cache reference........................... 7 ns
Mutex lock/unlock........................... 25 ns
Main memory reference...................... 100 ns
Compress 1K bytes with Zippy............. 3,000 ns = 3 µs
Send 2K bytes over 1 Gbps network....... 20,000 ns = 20 µs
SSD random read............................ 150,000 ns = 150 µs
Read 1 MB sequentially from memory..... 250,000 ns = 250 µs
Round‑trip within same datacenter......... 500,000 ns = 0.5 ms
Read 1 MB sequentially from SSD*......... 1,000,000 ns = 1 ms
Shanghai‑to‑Shanghai network latency...... 1,000,000 ns = 1 ms
Disk seek.................................. 10,000,000 ns = 10 ms
Read 1 MB sequentially from disk........ 20,000,000 ns = 20 ms
Beijing‑to‑Shanghai network latency...... 30,000,000 ns = 30 ms
Send packet CA‑>Netherlands‑>CA........... 150,000,000 ns = 150 ms

Network latency between Beijing and Shanghai (~30 ms) is roughly 60× slower than intra‑datacenter latency, making cross‑zone calls prohibitively expensive for latency‑sensitive food‑delivery workflows.

Overall Structure

The architecture combines service sharding, traffic routing, data replication, and a data‑access layer that enforces zone‑aware writes. The diagram below (originally in Chinese) shows the high‑level components and their interactions.

Business Adaptation

Because the system is zone‑aware, business logic can filter out data from other zones, trigger custom actions during failover, and implement repair procedures for inconsistent data.

Core Middleware

APIRouter – HTTP reverse proxy and load balancer that maps traffic to the correct eZone.

Global Zone Service – maintains routing tables and distributes updates to all services.

SOA Proxy – internal gateway for inter‑zone service calls.

Data Replication Center (DRC, ZooKeeper, MQ, Redis) – ensures near‑real‑time data sync.

Data Access Layer – final gate that blocks illegal writes and supports Global Zone read/write separation.

Future Plans

Ele.me currently operates two active zones and plans to expand to three or four zones, including a new eZone in a public‑cloud environment, enabling global‑scale high availability and rapid capacity growth.

References

Latency numbers every programmer should know

LinkedIn Databus

Alibaba Canal Project

distributed systemsCloud NativeHigh Availabilitytraffic routingdata replicationmulti-active
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.