Operations 15 min read

2021 China Chaos Engineering Survey Report: Findings and Recommendations

Based on 1,016 valid questionnaire responses and 17 enterprise interviews, the 2021 China Chaos Engineering Survey Report reveals low software system stability, limited adoption of chaos engineering, its positive impact on availability, and provides data‑driven recommendations for improving stability through mature tools, metrics, and cultural shifts.

DevOps
DevOps
DevOps
2021 China Chaos Engineering Survey Report: Findings and Recommendations

The report, commissioned by China Academy of Information and Communications Technology (CAICT) and the Chaos Engineering Lab, presents the first nationwide survey on chaos engineering in China, collecting 1,016 effective questionnaire responses and conducting 17 in‑depth enterprise interviews.

Key Findings

Software system stability in Chinese enterprises has significant room for improvement: nearly 20% of products have availability below 99% (2‑nines) and over 40% below 99.9% (3‑nines); mean time to detect (MTTD) is under 1 hour for less than half of incidents, and mean time to repair (MTTR) exceeds 1 hour for more than 60% of incidents.

Chaos engineering adoption is still in its early stage: more than 30% of companies have never used it, only 8.68% apply it at large scale, and most experiments are confined to development or testing environments rather than production.

When used, chaos engineering markedly improves availability: 65% of respondents say it boosts service availability, and 49.85% report reduced MTTR.

Primary obstacles are lack of experience (46.32%) and concerns about production‑environment risk (45.29%).

Usage Patterns

Fault injection focuses on basic resource failures (network, compute); application‑level and container‑level faults receive less attention.

Experiment targets are mainly hosts/virtual machines rather than services.

Recommendations

Introduce mature, trustworthy chaos‑engineering products or consulting services to lower adoption barriers.

Establish a measurable stability assurance system, including monitoring, resilience metrics, and fault‑grading mechanisms.

Promote a “Stability First” culture and leverage cloud‑native, AI, and big‑data technologies to enhance system resilience.

The report provides detailed charts on adoption rates, fault‑type distribution, availability metrics, and the correlation between chaos‑engineering frequency and product reliability, offering a data‑backed roadmap for enterprises seeking to strengthen system stability.

Cloud Nativeoperationssystem stabilityChaos Engineeringsurvey
DevOps
Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.