Operations 12 min read

Applying Chaos Engineering at Ctrip: Practices, Experiments, and Platform Evolution

This article describes Ctrip's SRE team's journey in adopting chaos engineering, outlining the motivations, roadmap, concrete experiments, platform maturity, and future automation goals to improve system resilience and operational reliability in a large‑scale microservice environment.

Ctrip Technology

Jun 4, 2020

Applying Chaos Engineering at Ctrip: Practices, Experiments, and Platform Evolution

Author Bio Ctrip SRE, responsible for the reliability of Ctrip's website systems, exploring and implementing high‑availability operational architectures such as multi‑active disaster recovery, end‑to‑end stress testing, chaos engineering, and AIOps.

Why Chaos Engineering? Rapid evolution of Ctrip's business and technical architecture has raised the cost of downtime and reduced user tolerance for failures. Proactive fault injection through chaos engineering helps expose system fragilities early, reduces the impact of real incidents, and trains engineers in resilience practices.

Chaos Engineering Planning With tens of thousands of services and thousands of weekly releases, complete fault elimination is impossible. A good system should gracefully retry, throttle, or circuit‑break when failures occur. Ctrip adopted the five principles from "Chaos Engineering: Netflix's System Resilience" to build a roadmap for chaos practice.

Practice Initially, a fault‑injection platform was built to support common failure types. Fault scenarios were abstracted into five categories: entry‑point, application, data, system, and network. These scenarios are implemented using Linux tc for network jitter, Java bytecode tools (ASM, JavaAssist) for application‑level faults, and native Kubernetes mechanisms. Open‑source projects such as Netflix's Chaos Monkey, Alibaba's ChaosBlade, and PingCAP's ChaosMesh were referenced.

Experiment 1: Verifying Weak Dependency on Review Service Goal: Ensure the product‑detail service does not get crippled when the downstream review service experiences latency, by applying proper circuit‑breakers. The experiment injected targeted latency only for the product‑detail service while leaving other upstream services unaffected.

Experiment 2: Set‑Based Drill of Core Tools Goal: Validate that core tools (monitoring, disaster‑recovery, release) remain functional in a different data center when an entire data center fails. The drill performed unannounced instance shutdowns of core tools in IDC‑2 and observed continued service from IDC‑1.

Key Success Factors 1) Acceptance of chaos engineering as a cultural shift that balances functional and non‑functional requirements. 2) Maturity of the fault‑injection platform, which must be easy to use, support automated CI/CD integration, provide fine‑grained control, and offer observability dashboards.

Current Stage Automation of large‑scale experiments is the next focus. Planned work includes: mapping strong/weak service dependencies via APM and manual labeling; improving monitoring accuracy with LSTM‑based anomaly detection and AIOps; and leveraging experiment data to train intelligent fault‑diagnosis models.

Final Thoughts Chaos engineering is a methodology, not a tool; it requires a robust platform, comprehensive monitoring, intelligent alerts, tracing, architecture awareness, and rapid fault localization. Success depends on embracing failure, imagining worst‑case scenarios, and continuously validating resilience through controlled experiments.

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.