Operations 18 min read

Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes

This article describes how WeChat Pay applied chaos engineering to improve system reliability, detailing the business scenario, challenges of controlling fault injection radius, practical solutions, risk assessment, automation, and the resulting business and tool achievements.

FunTester

Mar 29, 2024

Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes

WeChat Pay, a critical national information infrastructure serving millions of merchants and billions of users, requires availability higher than five nines. To enhance reliability, the team explored chaos engineering, focusing on controlling the minimal fault injection radius while ensuring experiments have minimal impact on production.

The practice began by analyzing historical failures from 2018 to 2021, identifying hardware and software issues as major contributors. Traditional drills and chaos engineering were compared, highlighting that chaos engineering aims to discover unknown risks by injecting faults in production-like environments.

Key challenges included balancing fault effectiveness with business safety, controlling the explosion radius, and achieving realistic fault injection without disrupting services. A multi‑partition architecture was introduced in 2021, allowing isolated chaos partitions that mirror production environments while containing fault impact.

The team evaluated various fault injection strategies, from testing environments to fully independent online deployments, and concluded that independent chaos partitions provide the best trade‑off between safety and effectiveness.

Risk mitigation involved strict approval processes for destructive fault types, especially those that could corrupt data, and careful handling of external dependencies by routing traffic and discarding or isolating affected data.

Automation was pursued across experiment design, execution, and analysis. Experiments are defined by business assets, system resources, fault types, and fault severity, with templates generated based on high‑availability principles. The workflow includes automatic fault injection, steady‑state detection, and alert correlation to reduce manual effort.

Tooling evolved from manual fault injection to batch execution using YAML orchestration, and finally to scheduled, automated experiments integrated with automatic analysis. The platform now supports over 30 fault atoms, drag‑and‑drop experiment composition, and real‑time alert integration.

Results include zero production incidents since 2021, over 60 experiment plans, 500+ experiment runs covering core services, components, and frameworks, and the discovery and remediation of numerous high‑risk issues.

Future work aims to expand fault atom coverage, improve automated steady‑state detection with advanced AI, and support a broader range of explosion radii for non‑core services while maintaining safety.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations High Availability chaos engineering Fault Injection WeChat Pay

Written by

FunTester

10k followers, 1k articles | completely useless

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.