Operations 10 min read

Xiaomi's Practice of Chaos Engineering and Fault Injection Platform

This article details Xiaomi's implementation of chaos engineering, describing the principles, platform construction using ChaosBlade, a comprehensive fault‑injection workflow, case study results, operational insights, and future plans to enhance system reliability and observability.

DevOps
DevOps
DevOps
Xiaomi's Practice of Chaos Engineering and Fault Injection Platform

Background: As internet services expand, reliability and user experience become critical, prompting the need for continuous availability.

Chaos Engineering Introduction: Chaos engineering, pioneered by Netflix, involves injecting faults into production to discover weaknesses before they cause outages. It differs from traditional fault testing by exploring multiple scenarios around a defined steady state.

Principles: The article outlines five principles—hypothesis around steady‑state behavior, diverse real‑world events, running experiments in production, continuous automated execution, and minimizing blast radius.

Fault‑Injection Platform Construction: Xiaomi built a platform based on the open‑source ChaosBlade tool, providing automated, visual fault injection for system‑level and network‑level scenarios such as CPU, memory, latency, and packet loss. The platform defines goals, functions, and a modular architecture.

Practice: A case study on a multi‑node task‑scheduling service (Business A) demonstrates the workflow: define steady‑state metrics, create hypotheses, replicate traffic, inject random faults, observe impacts, and validate or refute hypotheses.

Observations: Experiments revealed CPS drops, increased task failure rates, latency spikes, and process crashes, highlighting monitoring gaps.

Q&A: Access control, safety mechanisms, and automatic termination of injections are discussed.

Future Plans: The roadmap includes expanding fault scenarios across IaaS/PaaS/SaaS layers, integrating SLO‑driven degradation, and automated topology discovery.

observabilityChaos EngineeringSREreliabilityFault Injection
DevOps
Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.