Chaos Engineering: Principles, Practices, and Lessons from Ctrip
The article explains Chaos Engineering as a discipline for deliberately injecting failures into distributed systems to uncover hidden weaknesses, outlines its five core principles, describes practical implementation steps and real‑world examples from Ctrip, and discusses future directions for reliability engineering.
During the 2019 Ctrip Technology Summit in Shanghai, operations director Fang Ju shared a candid summary of how Chaos Engineering is applied at Ctrip.
What is Chaos Engineering? Chaos Engineering is the practice of running experiments on distributed systems to build confidence in their ability to withstand uncontrolled conditions in production, revealing unknown weaknesses before they cause major incidents.
The most important tenet is to "fail continuously to avoid failure" because failures are inevitable and often unpredictable.
Why adopt Chaos Engineering? As businesses grow and architectures evolve, maintaining stable user experiences requires proactively exposing risks. Instead of reacting after large outages, controlled experiments act like a vaccine, exposing vulnerabilities early so they can be mitigated.
Chaos Engineering also trains development teams, improving both technical skills and on‑the‑spot decision‑making.
The Five Principles of Chaos Engineering
Assume a steady state – Define normal system metrics before injecting faults such as server crashes or network partitions.
Run experiments in production – Production environments provide realistic conditions that test environments cannot fully replicate.
Automate and run continuously – Manual experiments are costly; automation enables sustainable testing.
Minimize blast radius – Carefully assess risk and design experiments to limit impact, with mechanisms to abort when thresholds are exceeded.
Diverse fault scenarios – Analyze historical incidents and abstract them into reusable fault injection templates.
Fang illustrated these principles with a real Ctrip case: a chaos experiment on the product‑detail service injected latency into a dependent review service, defining a steady state of QPS 1000 and response time 300 ms, then observing how the system behaved when the review service was throttled.
Implementation involves planning, defining steady state, assessing risk, executing the fault, monitoring the transition from steady to new state, and finally evaluating results and recording fixes.
Future plans at Ctrip include expanding tool coverage to more fault types, increasing automation for both production and test environments, and fostering a "design for failure" culture where teams regularly challenge system resilience.
In summary, Chaos Engineering is not about creating chaos for its own sake but about exposing hidden risks within a controlled, bounded environment to strengthen system robustness.
System: "What cannot destroy me makes me stronger." Have you tried chaos experiments in your team?
Download Fang Ju’s PPT here .
Recommended Reading
2019 Ctrip Technology Summit Review (PPT & Video)
Ctrip Technology Book Release
Behind Ctrip’s Intelligent Customer Service Bot
One‑Click Conversion from WeChat Mini‑Program to Baidu Mini‑Program
Ctrip’s Journey with Dubbo
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
