Operations 13 min read

Stability Testing Practices for Meituan Smart Payment: Fault Drills, Online Load Testing, and Continuous Operation

Meituan’s smart‑payment team combats growing complexity and third‑party failures by implementing a stability‑building program that raises availability through flexible degradation, rapid recovery, and three core QA practices—fault drills, online full‑link load testing, and a continuous operation system that standardizes processes, visualizes metrics, and automates resilience checks.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Stability Testing Practices for Meituan Smart Payment: Fault Drills, Online Load Testing, and Continuous Operation

This article introduces the challenges faced by Meituan's smart payment business in the stability domain and highlights QA methods and practices for stability testing.

Background: Meituan Payment handles all transaction flow. It consists of online payment (supporting e‑commerce, food delivery, travel, etc.) and smart payment (supporting in‑store consumption via POS, QR code, box payment). Smart payment has become one of the fastest‑growing services.

Challenges: Rapid business growth increases system complexity—more entry points, richer payment channels, vertical service layering, horizontal service splitting, and growing dependencies on external systems (marketing, membership, risk control) and internal infrastructure (queues, caches). Over 20 core service nodes are involved. Team expansion from a few members to nearly a hundred adds instability. Historical incidents show 72% of severe failures stem from third‑party services or infrastructure (e.g., unstable payment channels, message queue failures), causing cascading outages.

Solution: A stability‑building project aims to raise availability from 99.9% (2 nines) to 99.99% (3 nines) and eventually 99.999% (4 nines). Two core strategies are adopted: flexible availability (ensure core functions stay usable under degradation) and rapid recovery (quick fault localization and resolution). Common operations include rate limiting, circuit breaking, scaling, SOPs, and automated fault handling. QA focuses on validating these operations through three “key swords”: fault drills, online load testing, and a continuous operation system.

Fault Drills – Origin

A real incident occurred when a payment channel became unstable. The pre‑planned response (server disables the channel, client greys out the option) failed: the client still showed the channel as active. This highlighted the need to reproduce fault scenarios to verify response plans.

Fault Drill Overall Plan

The plan consists of three modules:

Load generation module – reproduces real business flows, covering core processes.

Fault injection module – provides tools and a fault sample library covering external services, infrastructure, data centers, network, focusing on timeout and exception cases.

Business verification module – combines automated test cases with monitoring dashboards.

Two‑stage execution: first, single‑system drills covering all protection plans; second, full‑link drills focusing on core service failures to validate end‑to‑end fault tolerance.

Fault Drill Effects

Drills uncovered hidden issues such as database master‑slave replication lag affecting transactions, lack of degradation for infrastructure failures, and unreasonable timeout or rate‑limit settings for dependent services.

Online Load Testing – Origin

Exponential business growth required precise capacity assessment. QA needed a method that reflects the full‑link complexity and bridges the gap between offline and online environments, leading to the adoption of online full‑link load testing.

Online Load Testing Overall Plan

The workflow includes:

Scenario modeling – recreates realistic online operation conditions.

Base data construction – ensures correct data types and volumes, avoiding hotspots.

Traffic generation – builds read/write traffic or replays traffic, with tagging and desensitization.

Test execution – collects node‑level business status and resource usage.

Report generation – produces detailed analysis.

The approach supports single‑link, layered, and full‑link testing, and enables online fault drills to validate rate‑limit and circuit‑break protection.

Online Load Testing Effects

Full‑link testing revealed capacity limits and high‑risk issues, including infrastructure imbalances, severe DB replication lag, thread‑pool misconfigurations, and overly low rate‑limit thresholds.

Continuous Operation System – Origin

After a three‑month stability project for smart payment, the approach was extended to the entire financial services platform. A virtual project team continued the effort for another three months, highlighting the need for a long‑term, systematic operation model. QA led SRE, DBA, and RD to build an initial continuous operation framework.

Continuous Operation System – Overall Plan

Three strategic pillars:

Process standardization & tooling – treat configuration changes like code releases (PR workflow), enforce coding standards via automated tools.

Quality metric visualization – extract stability‑related KPIs (e.g., DB slow‑query count, core service latency) and monitor them in real time to drive PDCA cycles.

Routine drills & load tests – automate trigger of drills and verify SOP effectiveness across teams.

These pillars form a closed‑loop operation system: metric collection → analysis → solution → tool/knowledge consolidation.

Continuous Operation System – Effects

The system now provides risk assessment dashboards, quality overviews, issue tracking, and best‑practice documentation.

Future Plans

Three directions: 1) Enhance test effectiveness by expanding fault sample libraries, improving drill tools, and refining load‑test schemes; 2) Continue platformization (operation platform, data platform); 3) Move toward intelligent operation—progressing from manual to automated to AI‑assisted processes.

Author Introduction

Junwei Xun, Senior Test Development Engineer at Meituan, leader of smart payment QA for the financial services platform, joined Meituan in 2015.

Join the Meituan testing technical community (WeChat ID: MTDPtech02) and reply with “测试” to be added automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

stability testingLoad TestingFault InjectionMeituanQA
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.