Applying Chaos Engineering at Ctrip: Practices, Experiments, and Platform Evolution
This article describes Ctrip's SRE team's journey in adopting chaos engineering, outlining the motivations, roadmap, concrete experiments, platform maturity, and future automation goals to improve system resilience and operational reliability in a large‑scale microservice environment.
Author Bio Ctrip SRE, responsible for the reliability of Ctrip's website systems, exploring and implementing high‑availability operational architectures such as multi‑active disaster recovery, end‑to‑end stress testing, chaos engineering, and AIOps.
Why Chaos Engineering? Rapid evolution of Ctrip's business and technical architecture has raised the cost of downtime and reduced user tolerance for failures. Proactive fault injection through chaos engineering helps expose system fragilities early, reduces the impact of real incidents, and trains engineers in resilience practices.
Chaos Engineering Planning With tens of thousands of services and thousands of weekly releases, complete fault elimination is impossible. A good system should gracefully retry, throttle, or circuit‑break when failures occur. Ctrip adopted the five principles from "Chaos Engineering: Netflix's System Resilience" to build a roadmap for chaos practice.
Practice Initially, a fault‑injection platform was built to support common failure types. Fault scenarios were abstracted into five categories: entry‑point, application, data, system, and network. These scenarios are implemented using Linux tc for network jitter, Java bytecode tools (ASM, JavaAssist) for application‑level faults, and native Kubernetes mechanisms. Open‑source projects such as Netflix's Chaos Monkey, Alibaba's ChaosBlade, and PingCAP's ChaosMesh were referenced.
Experiment 1: Verifying Weak Dependency on Review Service Goal: Ensure the product‑detail service does not get crippled when the downstream review service experiences latency, by applying proper circuit‑breakers. The experiment injected targeted latency only for the product‑detail service while leaving other upstream services unaffected.
Experiment 2: Set‑Based Drill of Core Tools Goal: Validate that core tools (monitoring, disaster‑recovery, release) remain functional in a different data center when an entire data center fails. The drill performed unannounced instance shutdowns of core tools in IDC‑2 and observed continued service from IDC‑1.
Key Success Factors 1) Acceptance of chaos engineering as a cultural shift that balances functional and non‑functional requirements. 2) Maturity of the fault‑injection platform, which must be easy to use, support automated CI/CD integration, provide fine‑grained control, and offer observability dashboards.
Current Stage Automation of large‑scale experiments is the next focus. Planned work includes: mapping strong/weak service dependencies via APM and manual labeling; improving monitoring accuracy with LSTM‑based anomaly detection and AIOps; and leveraging experiment data to train intelligent fault‑diagnosis models.
Final Thoughts Chaos engineering is a methodology, not a tool; it requires a robust platform, comprehensive monitoring, intelligent alerts, tracing, architecture awareness, and rapid fault localization. Success depends on embracing failure, imagining worst‑case scenarios, and continuously validating resilience through controlled experiments.
Recommended Reading
Ctrip Redis Governance Evolution
Ctrip ClickHouse Log Analysis Practice
Ctrip Dubbo Connection Timeout Case Study
Ctrip Container Sporadic Timeout Issue Analysis
Ctrip Architecture Practice & AI Practice Books Released
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.