How Ele.me Achieved Cross‑Region Active‑Active MySQL: Architecture, Challenges & Lessons
This article details Ele.me's practical experience building a cross‑region active‑active database system, covering latency challenges, architectural design, extensive database refactoring, DBA operational hurdles, consistency verification tools, and future scalability plans.
1. Challenges in Active‑Active
Ele.me needed to implement cross‑region (Beijing‑Shanghai) active‑active databases, confronting network latency of about 30 ms, which can amplify to hundreds of milliseconds for frequent calls, making many applications intolerant to such delays.
Key difficulties include:
Distinguishing between same‑city and cross‑city active‑active; same‑city latency is negligible, but cross‑city latency requires careful design.
Ensuring data safety with multiple write points, avoiding conflicts, circular replication, and data loops.
Maintaining consistency despite multiple write sources.
To mitigate latency impact, Ele.me groups user traffic so that a single user’s requests are routed to the same data center and classifies services as either active‑active capable or globally shared (e.g., user data).
Traffic routing relies on geographic fences (POI) and a virtual ShardingKey that maps logical shards to physical locations, with APIRouter directing traffic accordingly.
Data‑conflict prevention involves adding a DRC timestamp column to all tables to resolve conflicts by selecting the newest record.
2. Active‑Active Architecture
The architecture consists of entry‑traffic routing, flow control, and cross‑data‑center synchronization components. A crucial component is DRC , which includes three services: Replicator (collects changes), Applier (writes changes to the remote data center), and Manager (controls the process).
Two main DB deployment models are used:
ShardingZone : both reads and writes are served locally; failover only switches traffic without changing underlying data placement.
GlobalZone : writes are centralized in one data center while reads are served locally, suitable for low‑write, high‑read workloads that tolerate higher latency.
3. Database Refactoring
The migration required full data transfer of several hundred terabytes across clusters, adding DRC timestamps, converting primary keys from
INTto
BIGINT, and adjusting foreign keys, which involved massive DDL operations.
Business‑type segregation forced the split of over 50 databases into separate instances, and network‑segment adjustments were needed to broaden IP ranges for accounts.
HA configurations were duplicated across data centers, increasing failure‑handling capacity but also raising operational load.
4. DBA Challenges
DBAs faced consistency verification, HA management, configuration drift, capacity planning, and massive DDL workloads. To address consistency, Ele.me built the DCP platform, which performs full and incremental data checks, supports black‑/white‑list rules, and can compare table structures and multi‑dimensional data.
DCP also provides automated repair tools and scripts, handling millions of records daily across hundreds of clusters.
For HA, the EMHA system automatically detects node changes, updates MHA configurations, notifies DRC of master switches, and synchronizes Proxy settings, reducing manual intervention.
5. DDL Automation and Tools
Traditional PT‑based DDL caused high TPS spikes and latency. Ele.me developed mm‑ost , a fork of gh‑ost, enabling cross‑data‑center DDL with latency kept under 3‑5 seconds, supporting pause, throttling, and peak‑aware scheduling.
The release platform orchestrates mm‑ost, enforcing safety checks (DDL space, latency limits, lock handling) and can auto‑execute low‑risk changes, achieving an 8:2 ratio of automated to manual DDL deployments.
6. Benefits and Outlook
Active‑active eliminated single‑data‑center capacity bottlenecks, allowed dynamic traffic shifting during incidents, and improved overall availability. Over 20 traffic cut‑overs (including drills) have demonstrated resilience.
Future work includes adding a third data center to spread cost, implementing data sharding across regions, automating dynamic scaling, and pursuing strong consistency guarantees for critical data.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.