How Qunar Turned Chaos Engineering into Reliable Operations: A Deep Dive
This article explores Qunar's practical implementation of chaos engineering, detailing its value, the four strategic directions, shutdown and application drills, strong‑weak dependency handling, container support, and automated closed‑loop testing that together boost system resilience, process robustness, and user experience.
This article is based on Zhu Shizhi's talk at the GOPS Global Operations Conference 2021 (Shanghai) and explains how Qunar applied chaos engineering to turn probabilistic problems into deterministic ones.
The discussion follows four main directions:
Evaluating the value of chaos engineering for a company and the benefits it can bring.
Describing the architecture of Qunar's chaos engineering platform as a medium‑size internet company.
Implementing automated closed‑loop drills, which are critical.
Integrating chaos engineering with other service domains to create greater value.
1. Value of Chaos Engineering
Recent major outages (KT network failure, Facebook outage, Ctrip ticketing issue) highlight the need for proactive reliability. Qunar runs over 3,000 active applications, 18,000 service interfaces, 3,500 HTTP domains, and more than 13,000 message‑queue topics, making complete reliability impossible without systematic testing.
Failures are categorized by type and impact: IDC data‑center, middleware, machine resources, application issues, and upstream/downstream dependencies.
Chaos engineering, originated by Netflix, proactively injects failures to observe real system behavior and build confidence that the system can withstand uncontrolled conditions.
The two main goals are to establish confidence that systems are reliable beyond probability and to convert probabilistic issues into deterministic ones.
Benefits are three‑fold: People – users experience more stable services; Process – fault handling shifts from passive to active detection and validation; System – overall resilience improves.
2. Qunar's Chaos Engineering Practice
Since 2019 Qunar has rolled out a chaos engineering platform in multiple stages.
Key layers include data‑center, middleware, servers, applications, and service dependencies. Early focus was on the three lower layers.
2.1 Shutdown Drills
Simulate complete shutdown of a data‑center or service line, affecting over a thousand machines, to verify impact and recovery time.
Maintain a comprehensive CMDB or asset platform for grouping resources.
Embed communication mechanisms into the drill workflow.
Perform real shutdowns, not mock ones.
Integrate core business metric alarms to abort drills if needed.
Automate service restoration after shutdown.
2.2 Application Drills
After infrastructure is secured, conduct application‑level drills such as inducing continuous FullGC, log bottlenecks, or degraded response rates to test resilience under partial failure.
2.3 Strong/Weak Dependency Tests
Identify and test strong versus weak dependencies ("if a downstream service fails, can the upstream survive?") to prevent cascading failures.
2.4 Container Support
Adopt container‑native execution using ChaosBlade, supporting both VM and Kubernetes environments.
ChaosBlade provides an execution layer with command‑line and HTTP interfaces, covering OS, language stack, and Kubernetes failures.
3. Automated Closed‑Loop Drill Strategy
Automation requires complete application metadata, traffic injection, and assertion mechanisms.
Metadata includes protocols, interfaces, data types, and trace topology, enabling automatic generation of dependency graphs.
Traffic can be real production traffic or generated via internal automated testing platforms, with assertions derived from metric alarms and AI‑driven diagnostics.
Two execution modes are used:
Incremental drills for newly added dependencies.
Full‑scale drills to catch regressions in existing dependencies.
Results showed that while 50% of interfaces were initially thought to be weak, 73% actually behaved as strong dependencies under failure.
4. Combining Chaos Engineering with Service Governance
Chaos engineering validates service‑governance configurations such as timeout, circuit‑breaker, and rate‑limit settings by pre‑testing them under failure scenarios.
Effective explosion‑radius control is achieved by analyzing dependency graphs and selecting systems whose failure will not propagate critical impact.
Overall, the integrated approach enhances reliability, reduces manual incident response, and provides data‑driven confidence in service‑governance policies.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.