Meituan’s Scalable Database Disaster Recovery: Architecture, Practices & Future
This article explains Meituan's multi‑stage disaster‑recovery strategy for databases, detailing the evolution from single‑active to N+1 and unit‑based architectures, the challenges of ultra‑large clusters, the DDTP platform's capabilities, and future plans to automate and extend resilience across regions.
Disaster Recovery Overview
Failures are categorized as host‑level, data‑center‑level, and region‑level. The probability of occurrence decreases from host to region, while the impact increases. The primary goal of disaster‑recovery (DR) is to sustain business continuity during large‑scale data‑center or region outages.
Business DR Architecture
Evolution of DR Architecture
DR 1.0 (Data‑centric, single‑active): Primary‑backup deployment; the standby data center does not serve traffic.
DR 2.0 (Application‑centric, same‑city dual‑active): Both data centers handle traffic; a remote cold standby provides disaster capacity.
DR 3.0 (Business‑centric, unit‑based): Each unit backs up another unit, enabling same‑city and cross‑region multi‑active deployments with strong scalability.
Meituan DR Patterns
N+1 Architecture: System capacity C is distributed across N+1 data centers; each provides at least C/N. If any center fails, the remaining centers sustain full capacity. DR logic is pushed down to PaaS components, which perform independent failover.
SET (unit‑based) Architecture: Applications, data, and core components are split into multiple isolated units. Units mutually back up each other, achieving same‑city or cross‑region DR. This model offers strong isolation and scalability but requires extensive application refactoring and complex operations.
Database DR Construction
Challenges of Ultra‑Large Clusters
Performance bottlenecks: Concurrent fault handling at massive scale strains HA systems.
Control‑plane complexity: More clusters increase the risk of single points of failure.
Frequent large‑scale faults: Rare events become common as cluster count grows.
High drill cost, low frequency: Limited validation leaves many failure scenarios untested.
Basic High‑Availability Topologies
Meituan operates two primary database topologies:
Master‑slave: Applications access databases via middleware. On fault, the middleware detects the issue, adjusts the topology, pushes new configuration, and restores service.
MGR (MySQL Group Replication): Middleware adapts to MGR through “Zebra for MGR”. Automatic topology detection triggers seamless failover.
DR Construction Path
Define DR objectives.
Establish DR standards (e.g., N+1 capacity grading).
Build a DR platform (DDTP).
Strengthen foundational capabilities (backup‑restore, elastic scaling, HA, monitoring).
Conduct drills and verify effectiveness.
Operate risk‑management processes.
Database Disaster Tolerance Platform (DDTP)
DDTP provides two core abilities: a defensive DR control platform and an offensive database drill platform.
Foundation layer: Backup‑restore, resource management, elastic scaling, HA, and metric monitoring.
Orchestration layer: Operation Orchestration Service (OOS) composes service‑level runbooks for pre‑escape, in‑flight fallback, damage limitation, and post‑failure recovery.
Platform service layer: Implements DR control, assessment, pre‑escape, in‑flight fallback, recovery, and pre‑plan services.
Supported database services: MySQL, Blade, MGR, etc.
Pre‑Failure Escape
Batch master‑node switching and flow diversion from replicas move all traffic away from a failing data center before an outage, eliminating data‑loss risk and making the fault invisible to business.
In‑Failure Observation
A real‑time DR monitoring dashboard aggregates alarms, displays affected clusters or instances, and enables operators to trigger fallback switches quickly.
Damage Limitation During Failure
Pre‑defined runbooks handle common faults. If automatic HA fails, the platform attempts a fallback; if the platform itself is unavailable, the orchestration layer takes over. As a last resort, DBAs use a CLI tool to manually adjust topology, elect masters, and apply configurations.
Post‑Failure Recovery
After a data‑center outage, the DR decision engine expands capacity in surviving zones to restore N+1 capability. Future work focuses on rapid in‑place repair and scaling to avoid reliance on large spare resources.
Future Considerations
Closing capability gaps: Enhance ultra‑large‑scale escape and damage‑limitation, address cross‑region link failures, and support overseas expansion challenges.
Architecture iteration: Incorporate emerging technologies such as Database Mesh, Serverless, new HA proxies, and storage‑compute separation, adapting the DDTP platform accordingly.
Continuous improvement of DR automation, drill efficiency, and cross‑region resilience remains a strategic priority.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
