Operations 19 min read

How Ele.me Scaled to 10M+ Daily Orders with Multi‑Active Architecture

The talk details Ele.me’s rapid growth from 300k to over 10 million daily orders, describing the challenges of high‑concurrency, multi‑active micro‑service architecture, IDC planning, database refactoring, disaster‑recovery, NOC operations, and the systematic processes that enabled stable, scalable delivery across two data centers.

Efficient Ops

May 10, 2018

How Ele.me Scaled to 10M+ Daily Orders with Multi‑Active Architecture

Multi‑Active Scenario and Business Shape

Ele.me’s business exploded from 300k daily orders in 2015 to over 10 million by 2017, creating massive request volume, high concurrency, and micro‑service challenges. To support this scale, a 100 % redundant multi‑active architecture was gradually introduced.

Implementation Background

Five key background factors drove the multi‑active effort: business characteristics, technical complexity, operational fallback, frequent failures, and data‑center capacity limits.

Business Characteristics

Three traffic entrances: user app, merchant portal, and rider app.

Order flow requires sub‑minute response; delays cause complaints and loss to competitors.

Strong regional constraints (e.g., Shanghai orders stay in Shanghai).

Clear peak periods (around 11 am and 5‑6 pm).

Technical Complexity

The system is built on an SOA architecture with components written in multiple languages (PHP, Python, Java). Supporting tracing, SDK maintenance, and cross‑language compatibility added significant overhead.

Operational Fallback

The ops team maintains ~16 000 servers, 1 600 applications, and four physical IDC sites, handling provisioning, hardware standardization, and extensive database and cache refactoring (sharding, SQL audit, DAL middleware, Redis governance).

Frequent Failures

High incident rates (P2+ accidents daily) led to the creation of a NOC team modeled after Google SRE, with a standardized incident‑grading system (P0‑P5) based on impact, order loss ratio, monetary loss, and public opinion.

Multi‑Active Technical Architecture

Core components:

API Router : request entry and routing.

GZS (Global Zone Service) : manages geographic fences and shard allocation.

DRC (Data Replication Center) : cross‑data‑center database sync and cache subscription.

SOA Proxy : communication between active and non‑active services.

DAL : enhanced middleware to prevent writes to wrong data‑center.

The goal is to complete an entire order flow within a single data‑center while supporting strong consistency zones.

IDC Planning

In late 2016, two active data‑centers (Beijing and Shanghai) were selected, with a dedicated IDC partner. A dual‑ezone test environment was built, and VPC segmentation enabled seamless traffic split and failover.

SOA Service Refactor

Three registration modes were introduced:

Orig : legacy compatibility.

Prefix : unified registration for new multi‑active services.

Route : final mode that abstracts IDC, ezone, and ops details from business teams.

Database Refactor

Database clusters were rebuilt to support active‑active replication (DRC) for multi‑active zones and native replication for global zones. DAL middleware was enhanced with validation to block writes to incorrect zones.

Disaster Recovery Assurance

Three DR levels were defined: traffic‑entry failures, IDC‑internal failures, and complete data‑center outage. Automated failover drills simulate total zone loss, relying on experienced engineers and automated fault‑location services.

Operational System Exploration

Application Release

Two release strategies for multi‑active: treat all zones as one large cluster with staged gray releases, or treat each zone as an independent cluster with per‑zone gray and full releases.

Monitoring System

Full‑link monitoring with ezone tags.

Business‑level monitoring per data‑center.

Infrastructure monitoring (servers, network) without ezone distinction.

Pre‑plan and Drills

Standardized incident response playbooks and regular rehearsal cycles, supported by an upcoming automated drill orchestration platform.

Capacity Planning

CPU utilization per AppId is collected; Beijing handles ~52 % of traffic, Shanghai ~48 %. Weekly full‑link stress tests gauge critical‑path capacity and forecast additional server needs for projected traffic growth.

Single‑Data‑Center Cost Analysis

IDC costs are amortized monthly and compared against order volume to compute per‑order IT cost, enabling cost‑benefit analysis between owned IDC resources and cloud services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing Operations scalability Database multi-active IDC planning

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.