Operations 9 min read

Chaos Engineering: Principles, Practices, and Lessons from Ctrip

The article explains Chaos Engineering as a discipline for deliberately injecting failures into distributed systems to uncover hidden weaknesses, outlines its five core principles, describes practical implementation steps and real‑world examples from Ctrip, and discusses future directions for reliability engineering.

Ctrip Technology

Nov 14, 2019

Chaos Engineering: Principles, Practices, and Lessons from Ctrip

During the 2019 Ctrip Technology Summit in Shanghai, operations director Fang Ju shared a candid summary of how Chaos Engineering is applied at Ctrip.

What is Chaos Engineering? Chaos Engineering is the practice of running experiments on distributed systems to build confidence in their ability to withstand uncontrolled conditions in production, revealing unknown weaknesses before they cause major incidents.

The most important tenet is to "fail continuously to avoid failure" because failures are inevitable and often unpredictable.

Why adopt Chaos Engineering? As businesses grow and architectures evolve, maintaining stable user experiences requires proactively exposing risks. Instead of reacting after large outages, controlled experiments act like a vaccine, exposing vulnerabilities early so they can be mitigated.

Chaos Engineering also trains development teams, improving both technical skills and on‑the‑spot decision‑making.

The Five Principles of Chaos Engineering

Assume a steady state – Define normal system metrics before injecting faults such as server crashes or network partitions.

Run experiments in production – Production environments provide realistic conditions that test environments cannot fully replicate.

Automate and run continuously – Manual experiments are costly; automation enables sustainable testing.

Minimize blast radius – Carefully assess risk and design experiments to limit impact, with mechanisms to abort when thresholds are exceeded.

Diverse fault scenarios – Analyze historical incidents and abstract them into reusable fault injection templates.

Fang illustrated these principles with a real Ctrip case: a chaos experiment on the product‑detail service injected latency into a dependent review service, defining a steady state of QPS 1000 and response time 300 ms, then observing how the system behaved when the review service was throttled.

Implementation involves planning, defining steady state, assessing risk, executing the fault, monitoring the transition from steady to new state, and finally evaluating results and recording fixes.

Future plans at Ctrip include expanding tool coverage to more fault types, increasing automation for both production and test environments, and fostering a "design for failure" culture where teams regularly challenge system resilience.

In summary, Chaos Engineering is not about creating chaos for its own sake but about exposing hidden risks within a controlled, bounded environment to strengthen system robustness.

System: "What cannot destroy me makes me stronger." Have you tried chaos experiments in your team?

Download Fang Ju’s PPT here .