Databases 23 min read

Meituan Database Fault‑Injection and Chaos Engineering Practice

The article details Meituan's large‑scale database fault‑injection platform, explaining its architecture, capabilities, workflow, blast‑radius controls, random unnotified drills, operational metrics, and future plans aligned with a chaos‑engineering maturity model.

Meituan Technology Team

May 25, 2023

Meituan Database Fault‑Injection and Chaos Engineering Practice

01 Background

1.1 Introduction to Chaos Engineering

Chaos engineering is defined as experimental techniques that inject failures into a system to build confidence in its ability to withstand uncontrolled conditions, improving fault‑tolerance, reducing failure rates, and enhancing incident response efficiency.

It originated at Netflix in 2008 after a major database outage, was formalized in 2015, and later saw open‑source tools such as Chaos Blade (2019) and Chaos Mesh (2020) emerge.

1.2 Current DB Operations Status

Meituan's database operations face five trends: linear growth in cluster size and count, continuous increase in access volume, rising variety of incidents, larger impact per incident, and higher probability of low‑frequency events due to scale.

1.3 Pain Points & Role of Fault‑Injection

To meet higher stability requirements, the team focuses on five fault‑management stages: prevention, detection, analysis, recovery, and post‑mortem. Manual drills suffered from limited scenario coverage, low scalability, inability to test large‑scale failures, and poor blast‑radius control. The new platform addresses these by validating component defenses, verifying disaster‑recovery capability, testing response plans, and predicting business impact.

02 Construction Practice

2.1 System Architecture

The platform consists of six modules: Permission Management, Exercise Evaluation, Fault Injection, Metric Observation, Injection Termination, and Intelligent Analysis.

2.2 Capability Panorama

Core capabilities include support for MySQL, integration with middleware, data transfer, HA systems, and Binlog servers, and a closed‑loop process covering orchestration, injection, observation, termination, and review.

2.3 Fault Injection Capability

Supported fault types focus on node crashes (primary/replica), primary‑replica lag, and high active connections, covering >80% of observed incidents. The platform can orchestrate serial and parallel sub‑tasks and inject faults into physical machines, VMs, or containers.

Concurrent injection capacity reaches over 5,000 nodes per minute.

2.4 Exercise Workflow

The workflow is split into pre‑exercise, during‑exercise, and post‑exercise phases, with risk assessment, multi‑level approval, group notifications, pre‑checks, automatic termination on large impact, and post‑exercise cleanup and data collection.

2.5 Blast‑Radius Control

Control is achieved through physical isolation (single‑scenario per cluster, single‑machine selection) and traffic throttling via weighted routing in middleware. Additional checks before, during, and after the drill further limit impact.

2.6 Exercise Review

Review comprises four parts: overview (duration, clusters, scenarios, success rate, alarm count, request success rate), key steps with metric data, defense component metrics (e.g., detection time and switchover time for primary crash), and detailed timeline per cluster.

2.7 Random & Unnotified Exercises

Since 2022, the platform can generate random, unannounced drills across arbitrary clusters, scenarios, and times, simulating real‑world failure randomness while still applying risk checks and multi‑level approvals.

2.8 Mode Comparison

A comparison of regular, random unnotified, and pure chaos‑engineering drills shows differences in purpose (known vs. unknown failures), human supervision, and pre‑notification requirements.

2.9 Operation System

Operational metrics cover overall drill count and cluster coverage, drill coverage and compliance rates, post‑drill feedback, large‑scale drill size, injection success rate, injection latency distribution, business integration, and platform API success/latency.

03 Landing

3.1 Promotion

Drill promotion uses three forms: fault‑driven (triggered by real incidents), proactive learning drills, and DBA‑organized large‑scale exercises.

3.2 Experiment Environments

Three environments are provided: offline (for issue discovery and validation), online rehearsal (large‑scale disaster‑recovery verification), and live production (full‑scale realistic testing).

3.3 Drill Scale

Scales are classified as single‑cluster, medium (tens of clusters), and large‑scale (hundreds of clusters) to validate different aspects of resilience.

3.4 Scenarios & Levels

Drills progress from low‑impact to high‑impact scenarios and from low‑level to high‑level clusters, building confidence gradually.

3.5 Operational Data

Data shows low coverage for regular drills, early-stage adoption of random drills (2022 Q4), and rapid increase in large‑scale drills since late 2022, with many tasks involving over 100 clusters.

3.6 Defense Capability Verification

Verification focuses on RTO/RPO compliance, scaling decisions, and observability during large‑scale failures.

3.7 Business Benefits

Although drills may cause temporary disruption, they help discover hidden issues, reduce risk, and validate defense components, gaining business acceptance.

04 Future Outlook

4.1 Chaos‑Engineering Maturity Model (CEEM)

The current platform is compared against the CEEM model, revealing gaps in scenario coverage, steady‑state metric coverage, experiment observation, and cultural adoption.

4.2 Maturity Levels

Meituan is at the foundational level, with fault‑drills limited to core departments and early stability gains.

4.3 Specific Plans

Five focus areas are outlined: reducing per‑drill cost, enriching fault scenarios (including link‑level failures), enhancing observability, advancing automation (steady‑state assessment, auto‑termination, drill recommendation), and achieving routine, continuous drills.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

chaos engineering Meituan Maturity Model operational practices Database Fault Injection Large‑Scale Databases

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.