Stability Testing Practices for Meituan Smart Payment: Fault Drills, Online Load Testing, and Continuous Operation
Meituan’s smart‑payment team combats growing complexity and third‑party failures by implementing a stability‑building program that raises availability through flexible degradation, rapid recovery, and three core QA practices—fault drills, online full‑link load testing, and a continuous operation system that standardizes processes, visualizes metrics, and automates resilience checks.
This article introduces the challenges faced by Meituan's smart payment business in the stability domain and highlights QA methods and practices for stability testing.
Background: Meituan Payment handles all transaction flow. It consists of online payment (supporting e‑commerce, food delivery, travel, etc.) and smart payment (supporting in‑store consumption via POS, QR code, box payment). Smart payment has become one of the fastest‑growing services.
Challenges: Rapid business growth increases system complexity—more entry points, richer payment channels, vertical service layering, horizontal service splitting, and growing dependencies on external systems (marketing, membership, risk control) and internal infrastructure (queues, caches). Over 20 core service nodes are involved. Team expansion from a few members to nearly a hundred adds instability. Historical incidents show 72% of severe failures stem from third‑party services or infrastructure (e.g., unstable payment channels, message queue failures), causing cascading outages.
Solution: A stability‑building project aims to raise availability from 99.9% (2 nines) to 99.99% (3 nines) and eventually 99.999% (4 nines). Two core strategies are adopted: flexible availability (ensure core functions stay usable under degradation) and rapid recovery (quick fault localization and resolution). Common operations include rate limiting, circuit breaking, scaling, SOPs, and automated fault handling. QA focuses on validating these operations through three “key swords”: fault drills, online load testing, and a continuous operation system.
Fault Drills – Origin
A real incident occurred when a payment channel became unstable. The pre‑planned response (server disables the channel, client greys out the option) failed: the client still showed the channel as active. This highlighted the need to reproduce fault scenarios to verify response plans.
Fault Drill Overall Plan
The plan consists of three modules:
Load generation module – reproduces real business flows, covering core processes.
Fault injection module – provides tools and a fault sample library covering external services, infrastructure, data centers, network, focusing on timeout and exception cases.
Business verification module – combines automated test cases with monitoring dashboards.
Two‑stage execution: first, single‑system drills covering all protection plans; second, full‑link drills focusing on core service failures to validate end‑to‑end fault tolerance.
Fault Drill Effects
Drills uncovered hidden issues such as database master‑slave replication lag affecting transactions, lack of degradation for infrastructure failures, and unreasonable timeout or rate‑limit settings for dependent services.
Online Load Testing – Origin
Exponential business growth required precise capacity assessment. QA needed a method that reflects the full‑link complexity and bridges the gap between offline and online environments, leading to the adoption of online full‑link load testing.
Online Load Testing Overall Plan
The workflow includes:
Scenario modeling – recreates realistic online operation conditions.
Base data construction – ensures correct data types and volumes, avoiding hotspots.
Traffic generation – builds read/write traffic or replays traffic, with tagging and desensitization.
Test execution – collects node‑level business status and resource usage.
Report generation – produces detailed analysis.
The approach supports single‑link, layered, and full‑link testing, and enables online fault drills to validate rate‑limit and circuit‑break protection.
Online Load Testing Effects
Full‑link testing revealed capacity limits and high‑risk issues, including infrastructure imbalances, severe DB replication lag, thread‑pool misconfigurations, and overly low rate‑limit thresholds.
Continuous Operation System – Origin
After a three‑month stability project for smart payment, the approach was extended to the entire financial services platform. A virtual project team continued the effort for another three months, highlighting the need for a long‑term, systematic operation model. QA led SRE, DBA, and RD to build an initial continuous operation framework.
Continuous Operation System – Overall Plan
Three strategic pillars:
Process standardization & tooling – treat configuration changes like code releases (PR workflow), enforce coding standards via automated tools.
Quality metric visualization – extract stability‑related KPIs (e.g., DB slow‑query count, core service latency) and monitor them in real time to drive PDCA cycles.
Routine drills & load tests – automate trigger of drills and verify SOP effectiveness across teams.
These pillars form a closed‑loop operation system: metric collection → analysis → solution → tool/knowledge consolidation.
Continuous Operation System – Effects
The system now provides risk assessment dashboards, quality overviews, issue tracking, and best‑practice documentation.
Future Plans
Three directions: 1) Enhance test effectiveness by expanding fault sample libraries, improving drill tools, and refining load‑test schemes; 2) Continue platformization (operation platform, data platform); 3) Move toward intelligent operation—progressing from manual to automated to AI‑assisted processes.
Author Introduction
Junwei Xun, Senior Test Development Engineer at Meituan, leader of smart payment QA for the financial services platform, joined Meituan in 2015.
Join the Meituan testing technical community (WeChat ID: MTDPtech02) and reply with “测试” to be added automatically.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
