Meituan Database Disaster Recovery Practice: Architecture, Platform, and Future Directions
Meituan’s disaster‑recovery practice combines evolving DR architectures—from single‑active to unitized designs—with N+1 and SET patterns, dual HA models, a multi‑layer DDTP platform, high‑frequency drill frameworks, and future plans for ultra‑large‑scale automation, cross‑region resilience, overseas support, and emerging database‑mesh technologies.
This article, derived from Meituan Technology Salon Session 75, is the third part of the "Large‑Scale Database Cluster Stability" series. It presents Meituan's practical experience in building a database disaster‑recovery (DR) system, covering business architecture, DR platform capabilities, drill system, and future considerations.
1. Disaster‑Recovery Overview
Failures are classified into three categories: host‑level, datacenter‑level, and region‑level. The objective of DR is to handle large‑scale datacenter or region failures to guarantee continuous business operation. Recent high‑profile data‑center outages have made DR a mandatory requirement for IT enterprises.
2. Business DR Architecture
The DR architecture has evolved from single‑active (city‑level primary‑backup) to multi‑active and finally to unit‑based designs, described as DR 1.0, DR 2.0, and DR 3.0. Meituan’s majority of services are at the DR 2.0 stage (city‑level active‑active), while high‑volume, region‑level services adopt the DR 3.0 unitized approach.
Two main architectural patterns are used:
N+1 Architecture: The system is deployed across N+1 data centers; each center provides at least 1/N of the total capacity, ensuring that the loss of any single center does not affect overall capacity. This model pushes DR capability down to PaaS components, enabling independent failover.
Unitized (SET) Architecture: Applications, data, and infrastructure components are partitioned into independent units. Each unit handles a closed‑loop traffic flow and can provide intra‑city or inter‑region DR through mutual backup. This approach offers strong isolation and scalability but requires extensive application refactoring.
3. Database DR Construction
3.1 Challenges
Massive cluster scale brings performance bottlenecks, increased risk of DR failure due to complex control chains, and a higher frequency of large‑scale incidents. Additionally, drill costs are high and drill frequency is low, making real‑world validation difficult.
3.2 Basic High‑Availability
Meituan employs two HA patterns: traditional master‑slave and MySQL Group Replication (MGR). The HA layer is built on a customized Orchestrator, providing centralized control across regions.
3.3 DR Platform (DDTP)
The Database Disaster Tolerance Platform (DDTP) consists of two products: a DR control platform (defensive) and a database drill platform (offensive). Core functions include pre‑failure escape, in‑failure observation, loss mitigation, and post‑failure recovery.
Key layers:
Database Services: MySQL, Blade, MGR, etc.
Basic Capability Layer: Backup‑restore, resource management, elastic scaling, HA, monitoring.
Orchestration Layer: Operation Orchestration Service (OOS) composes capability modules into executable DR playbooks.
Platform Service Layer: DR control, observation, recovery, and plan services.
3.4.1 Capacity Compliance
Clusters are evaluated against a six‑level N+1 standard; level 4 and above satisfy N+1 requirements, with level 5 ensuring region‑level capacity parity.
3.4.2 Pre‑Failure Escape
Batch primary‑node switch‑over and replica flow‑cutting are performed before a fault is detected, reducing impact on business.
3.4.3 In‑Failure Observation
A real‑time DR monitoring dashboard aggregates alarms and provides a list of affected clusters for rapid manual or automated mitigation.
3.4.4 In‑Failure Loss Mitigation
When automatic HA fails, the system falls back to platform‑level mitigation, then to OOS orchestration, and finally to manual CLI tools that execute topology detection, primary election, and configuration updates.
3.4.5 Post‑Failure Recovery
After a data‑center outage, clusters are expanded to restore N+1 capacity. Future work includes in‑place instance repair and rapid cluster scaling.
3.5 Drill System
The drill framework emphasizes multi‑environment, high‑frequency, large‑scale, and long‑chain scenarios:
Isolated Environment: Fully separated from production, allowing safe network or power cuts.
Production Environment: Large‑scale drills on live clusters (1500+ clusters) with realistic load.
Real Zone: Dedicated AZ in public cloud for authentic network partitions.
Game Day: Ongoing evaluation of feasibility for continuous production‑level drills.
4. Future Considerations
Despite progress in automatic HA, DR governance, large‑scale fault observation, loss mitigation, and recovery, several gaps remain:
Enhance ultra‑large‑scale escape and loss‑mitigation capabilities.
Address cross‑region link failures with unitized or independent deployments.
Support overseas business requirements.
Automate DR decision‑making and reduce manual coordination.
Emerging technologies such as Database Mesh, Serverless, and new HA proxies will drive the next iteration of Meituan’s DR architecture.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
