How Ele.me Scaled to 10M+ Daily Orders with Multi‑Active Architecture
The talk details Ele.me’s rapid growth from 300k to over 10 million daily orders, describing the challenges of high‑concurrency, multi‑active micro‑service architecture, IDC planning, database refactoring, disaster‑recovery, NOC operations, and the systematic processes that enabled stable, scalable delivery across two data centers.
Multi‑Active Scenario and Business Shape
Ele.me’s business exploded from 300k daily orders in 2015 to over 10 million by 2017, creating massive request volume, high concurrency, and micro‑service challenges. To support this scale, a 100 % redundant multi‑active architecture was gradually introduced.
Implementation Background
Five key background factors drove the multi‑active effort: business characteristics, technical complexity, operational fallback, frequent failures, and data‑center capacity limits.
Business Characteristics
Three traffic entrances: user app, merchant portal, and rider app.
Order flow requires sub‑minute response; delays cause complaints and loss to competitors.
Strong regional constraints (e.g., Shanghai orders stay in Shanghai).
Clear peak periods (around 11 am and 5‑6 pm).
Technical Complexity
The system is built on an SOA architecture with components written in multiple languages (PHP, Python, Java). Supporting tracing, SDK maintenance, and cross‑language compatibility added significant overhead.
Operational Fallback
The ops team maintains ~16 000 servers, 1 600 applications, and four physical IDC sites, handling provisioning, hardware standardization, and extensive database and cache refactoring (sharding, SQL audit, DAL middleware, Redis governance).
Frequent Failures
High incident rates (P2+ accidents daily) led to the creation of a NOC team modeled after Google SRE, with a standardized incident‑grading system (P0‑P5) based on impact, order loss ratio, monetary loss, and public opinion.
Multi‑Active Technical Architecture
Core components:
API Router : request entry and routing.
GZS (Global Zone Service) : manages geographic fences and shard allocation.
DRC (Data Replication Center) : cross‑data‑center database sync and cache subscription.
SOA Proxy : communication between active and non‑active services.
DAL : enhanced middleware to prevent writes to wrong data‑center.
The goal is to complete an entire order flow within a single data‑center while supporting strong consistency zones.
IDC Planning
In late 2016, two active data‑centers (Beijing and Shanghai) were selected, with a dedicated IDC partner. A dual‑ezone test environment was built, and VPC segmentation enabled seamless traffic split and failover.
SOA Service Refactor
Three registration modes were introduced:
Orig : legacy compatibility.
Prefix : unified registration for new multi‑active services.
Route : final mode that abstracts IDC, ezone, and ops details from business teams.
Database Refactor
Database clusters were rebuilt to support active‑active replication (DRC) for multi‑active zones and native replication for global zones. DAL middleware was enhanced with validation to block writes to incorrect zones.
Disaster Recovery Assurance
Three DR levels were defined: traffic‑entry failures, IDC‑internal failures, and complete data‑center outage. Automated failover drills simulate total zone loss, relying on experienced engineers and automated fault‑location services.
Operational System Exploration
Application Release
Two release strategies for multi‑active: treat all zones as one large cluster with staged gray releases, or treat each zone as an independent cluster with per‑zone gray and full releases.
Monitoring System
Full‑link monitoring with ezone tags.
Business‑level monitoring per data‑center.
Infrastructure monitoring (servers, network) without ezone distinction.
Pre‑plan and Drills
Standardized incident response playbooks and regular rehearsal cycles, supported by an upcoming automated drill orchestration platform.
Capacity Planning
CPU utilization per AppId is collected; Beijing handles ~52 % of traffic, Shanghai ~48 %. Weekly full‑link stress tests gauge critical‑path capacity and forecast additional server needs for projected traffic growth.
Single‑Data‑Center Cost Analysis
IDC costs are amortized monthly and compared against order volume to compute per‑order IT cost, enabling cost‑benefit analysis between owned IDC resources and cloud services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
