Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes
This article describes how WeChat Pay applied chaos engineering to improve system reliability, detailing the business scenario, challenges of controlling fault injection radius, practical solutions, risk assessment, automation, and the resulting business and tool achievements.
WeChat Pay, a critical national information infrastructure serving millions of merchants and billions of users, requires availability higher than five nines. To enhance reliability, the team explored chaos engineering, focusing on controlling the minimal fault injection radius while ensuring experiments have minimal impact on production.
The practice began by analyzing historical failures from 2018 to 2021, identifying hardware and software issues as major contributors. Traditional drills and chaos engineering were compared, highlighting that chaos engineering aims to discover unknown risks by injecting faults in production-like environments.
Key challenges included balancing fault effectiveness with business safety, controlling the explosion radius, and achieving realistic fault injection without disrupting services. A multi‑partition architecture was introduced in 2021, allowing isolated chaos partitions that mirror production environments while containing fault impact.
The team evaluated various fault injection strategies, from testing environments to fully independent online deployments, and concluded that independent chaos partitions provide the best trade‑off between safety and effectiveness.
Risk mitigation involved strict approval processes for destructive fault types, especially those that could corrupt data, and careful handling of external dependencies by routing traffic and discarding or isolating affected data.
Automation was pursued across experiment design, execution, and analysis. Experiments are defined by business assets, system resources, fault types, and fault severity, with templates generated based on high‑availability principles. The workflow includes automatic fault injection, steady‑state detection, and alert correlation to reduce manual effort.
Tooling evolved from manual fault injection to batch execution using YAML orchestration, and finally to scheduled, automated experiments integrated with automatic analysis. The platform now supports over 30 fault atoms, drag‑and‑drop experiment composition, and real‑time alert integration.
Results include zero production incidents since 2021, over 60 experiment plans, 500+ experiment runs covering core services, components, and frameworks, and the discovery and remediation of numerous high‑risk issues.
Future work aims to expand fault atom coverage, improve automated steady‑state detection with advanced AI, and support a broader range of explosion radii for non‑core services while maintaining safety.
FunTester
10k followers, 1k articles | completely useless
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.