Databases 20 min read

Meituan Database Disaster Recovery Practice: Architecture, Platform, and Future Directions

Meituan’s disaster‑recovery practice combines evolving DR architectures—from single‑active to unitized designs—with N+1 and SET patterns, dual HA models, a multi‑layer DDTP platform, high‑frequency drill frameworks, and future plans for ultra‑large‑scale automation, cross‑region resilience, overseas support, and emerging database‑mesh technologies.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Meituan Database Disaster Recovery Practice: Architecture, Platform, and Future Directions

This article, derived from Meituan Technology Salon Session 75, is the third part of the "Large‑Scale Database Cluster Stability" series. It presents Meituan's practical experience in building a database disaster‑recovery (DR) system, covering business architecture, DR platform capabilities, drill system, and future considerations.

1. Disaster‑Recovery Overview

Failures are classified into three categories: host‑level, datacenter‑level, and region‑level. The objective of DR is to handle large‑scale datacenter or region failures to guarantee continuous business operation. Recent high‑profile data‑center outages have made DR a mandatory requirement for IT enterprises.

2. Business DR Architecture

The DR architecture has evolved from single‑active (city‑level primary‑backup) to multi‑active and finally to unit‑based designs, described as DR 1.0, DR 2.0, and DR 3.0. Meituan’s majority of services are at the DR 2.0 stage (city‑level active‑active), while high‑volume, region‑level services adopt the DR 3.0 unitized approach.

Two main architectural patterns are used:

N+1 Architecture: The system is deployed across N+1 data centers; each center provides at least 1/N of the total capacity, ensuring that the loss of any single center does not affect overall capacity. This model pushes DR capability down to PaaS components, enabling independent failover.

Unitized (SET) Architecture: Applications, data, and infrastructure components are partitioned into independent units. Each unit handles a closed‑loop traffic flow and can provide intra‑city or inter‑region DR through mutual backup. This approach offers strong isolation and scalability but requires extensive application refactoring.

3. Database DR Construction

3.1 Challenges

Massive cluster scale brings performance bottlenecks, increased risk of DR failure due to complex control chains, and a higher frequency of large‑scale incidents. Additionally, drill costs are high and drill frequency is low, making real‑world validation difficult.

3.2 Basic High‑Availability

Meituan employs two HA patterns: traditional master‑slave and MySQL Group Replication (MGR). The HA layer is built on a customized Orchestrator, providing centralized control across regions.

3.3 DR Platform (DDTP)

The Database Disaster Tolerance Platform (DDTP) consists of two products: a DR control platform (defensive) and a database drill platform (offensive). Core functions include pre‑failure escape, in‑failure observation, loss mitigation, and post‑failure recovery.

Key layers:

Database Services: MySQL, Blade, MGR, etc.

Basic Capability Layer: Backup‑restore, resource management, elastic scaling, HA, monitoring.

Orchestration Layer: Operation Orchestration Service (OOS) composes capability modules into executable DR playbooks.

Platform Service Layer: DR control, observation, recovery, and plan services.

3.4.1 Capacity Compliance

Clusters are evaluated against a six‑level N+1 standard; level 4 and above satisfy N+1 requirements, with level 5 ensuring region‑level capacity parity.

3.4.2 Pre‑Failure Escape

Batch primary‑node switch‑over and replica flow‑cutting are performed before a fault is detected, reducing impact on business.

3.4.3 In‑Failure Observation

A real‑time DR monitoring dashboard aggregates alarms and provides a list of affected clusters for rapid manual or automated mitigation.

3.4.4 In‑Failure Loss Mitigation

When automatic HA fails, the system falls back to platform‑level mitigation, then to OOS orchestration, and finally to manual CLI tools that execute topology detection, primary election, and configuration updates.

3.4.5 Post‑Failure Recovery

After a data‑center outage, clusters are expanded to restore N+1 capacity. Future work includes in‑place instance repair and rapid cluster scaling.

3.5 Drill System

The drill framework emphasizes multi‑environment, high‑frequency, large‑scale, and long‑chain scenarios:

Isolated Environment: Fully separated from production, allowing safe network or power cuts.

Production Environment: Large‑scale drills on live clusters (1500+ clusters) with realistic load.

Real Zone: Dedicated AZ in public cloud for authentic network partitions.

Game Day: Ongoing evaluation of feasibility for continuous production‑level drills.

4. Future Considerations

Despite progress in automatic HA, DR governance, large‑scale fault observation, loss mitigation, and recovery, several gaps remain:

Enhance ultra‑large‑scale escape and loss‑mitigation capabilities.

Address cross‑region link failures with unitized or independent deployments.

Support overseas business requirements.

Automate DR decision‑making and reduce manual coordination.

Emerging technologies such as Database Mesh, Serverless, and new HA proxies will drive the next iteration of Meituan’s DR architecture.

architectureplatformMeituan
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.