Meituan Database Fault‑Injection and Chaos Engineering Practice
The article details Meituan's large‑scale database fault‑injection platform, explaining its architecture, capabilities, workflow, blast‑radius controls, random unnotified drills, operational metrics, and future plans aligned with a chaos‑engineering maturity model.
01 Background
1.1 Introduction to Chaos Engineering
Chaos engineering is defined as experimental techniques that inject failures into a system to build confidence in its ability to withstand uncontrolled conditions, improving fault‑tolerance, reducing failure rates, and enhancing incident response efficiency.
It originated at Netflix in 2008 after a major database outage, was formalized in 2015, and later saw open‑source tools such as Chaos Blade (2019) and Chaos Mesh (2020) emerge.
1.2 Current DB Operations Status
Meituan's database operations face five trends: linear growth in cluster size and count, continuous increase in access volume, rising variety of incidents, larger impact per incident, and higher probability of low‑frequency events due to scale.
1.3 Pain Points & Role of Fault‑Injection
To meet higher stability requirements, the team focuses on five fault‑management stages: prevention, detection, analysis, recovery, and post‑mortem. Manual drills suffered from limited scenario coverage, low scalability, inability to test large‑scale failures, and poor blast‑radius control. The new platform addresses these by validating component defenses, verifying disaster‑recovery capability, testing response plans, and predicting business impact.
02 Construction Practice
2.1 System Architecture
The platform consists of six modules: Permission Management, Exercise Evaluation, Fault Injection, Metric Observation, Injection Termination, and Intelligent Analysis.
2.2 Capability Panorama
Core capabilities include support for MySQL, integration with middleware, data transfer, HA systems, and Binlog servers, and a closed‑loop process covering orchestration, injection, observation, termination, and review.
2.3 Fault Injection Capability
Supported fault types focus on node crashes (primary/replica), primary‑replica lag, and high active connections, covering >80% of observed incidents. The platform can orchestrate serial and parallel sub‑tasks and inject faults into physical machines, VMs, or containers.
Concurrent injection capacity reaches over 5,000 nodes per minute.
2.4 Exercise Workflow
The workflow is split into pre‑exercise, during‑exercise, and post‑exercise phases, with risk assessment, multi‑level approval, group notifications, pre‑checks, automatic termination on large impact, and post‑exercise cleanup and data collection.
2.5 Blast‑Radius Control
Control is achieved through physical isolation (single‑scenario per cluster, single‑machine selection) and traffic throttling via weighted routing in middleware. Additional checks before, during, and after the drill further limit impact.
2.6 Exercise Review
Review comprises four parts: overview (duration, clusters, scenarios, success rate, alarm count, request success rate), key steps with metric data, defense component metrics (e.g., detection time and switchover time for primary crash), and detailed timeline per cluster.
2.7 Random & Unnotified Exercises
Since 2022, the platform can generate random, unannounced drills across arbitrary clusters, scenarios, and times, simulating real‑world failure randomness while still applying risk checks and multi‑level approvals.
2.8 Mode Comparison
A comparison of regular, random unnotified, and pure chaos‑engineering drills shows differences in purpose (known vs. unknown failures), human supervision, and pre‑notification requirements.
2.9 Operation System
Operational metrics cover overall drill count and cluster coverage, drill coverage and compliance rates, post‑drill feedback, large‑scale drill size, injection success rate, injection latency distribution, business integration, and platform API success/latency.
03 Landing
3.1 Promotion
Drill promotion uses three forms: fault‑driven (triggered by real incidents), proactive learning drills, and DBA‑organized large‑scale exercises.
3.2 Experiment Environments
Three environments are provided: offline (for issue discovery and validation), online rehearsal (large‑scale disaster‑recovery verification), and live production (full‑scale realistic testing).
3.3 Drill Scale
Scales are classified as single‑cluster, medium (tens of clusters), and large‑scale (hundreds of clusters) to validate different aspects of resilience.
3.4 Scenarios & Levels
Drills progress from low‑impact to high‑impact scenarios and from low‑level to high‑level clusters, building confidence gradually.
3.5 Operational Data
Data shows low coverage for regular drills, early-stage adoption of random drills (2022 Q4), and rapid increase in large‑scale drills since late 2022, with many tasks involving over 100 clusters.
3.6 Defense Capability Verification
Verification focuses on RTO/RPO compliance, scaling decisions, and observability during large‑scale failures.
3.7 Business Benefits
Although drills may cause temporary disruption, they help discover hidden issues, reduce risk, and validate defense components, gaining business acceptance.
04 Future Outlook
4.1 Chaos‑Engineering Maturity Model (CEEM)
The current platform is compared against the CEEM model, revealing gaps in scenario coverage, steady‑state metric coverage, experiment observation, and cultural adoption.
4.2 Maturity Levels
Meituan is at the foundational level, with fault‑drills limited to core departments and early stability gains.
4.3 Specific Plans
Five focus areas are outlined: reducing per‑drill cost, enriching fault scenarios (including link‑level failures), enhancing observability, advancing automation (steady‑state assessment, auto‑termination, drill recommendation), and achieving routine, continuous drills.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
