Operations 18 min read

Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation

This article details Qunar Travel's mature chaos engineering platform built on chaosblade, covering value analysis, system architecture, shutdown and dependency drills, automated closed‑loop testing, attack‑defense exercises, and the measurable reliability improvements achieved across thousands of services.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation

Author Introduction Yu Haiying joined Qunar Travel in 2014 as a test development engineer, responsible for ticket service backend testing, and began promoting chaos engineering in 2021 to detect quality gaps and build resilience.

1. Introduction Qunar's chaos engineering, based on the fault‑injection tool chaosblade, has matured after more than two years of practice. Its core highlights include lossless drills, automated stop‑loss, full‑stack detection, and visual report output. The article explains four aspects of the practice: value analysis, platform architecture, automated closed‑loop drills, and attack‑defense exercises.

2. Value Analysis

2.1 Background Historical major outages such as Facebook’s 7‑hour server failure and a nationwide Korean telecom outage illustrate the severe impact of infrastructure failures. These incidents highlight the need for proactive solutions.

2.1.2 Complex Clusters Qunar runs over 3,000 active applications, 18,000 Dubbo interfaces, 3,500 gateway domains, and 13,000 QMQ topics across five programming languages, making complete reliability difficult.

2.1.4 Common Fault Types Faults are categorized into data‑center issues, middleware failures (ZK, MQ, DB, cache), machine problems (CPU, disk, I/O), application issues (Full GC, service downtime, log slowdown, thread‑pool exhaustion), and dependency problems (downstream latency or exceptions).

2.2 Chaos Engineering Concept Chaos engineering is a discipline that deliberately injects failures into distributed systems to observe behavior, enabling pre‑emptive mitigation and improved emergency response.

2.3 Goals The goals are to build confidence that systems can withstand unpredictable production issues and to turn probabilistic problems into deterministic ones.

2.4 Benefits Benefits are three‑fold: (1) Users enjoy more stable experiences; (2) Fault‑handling processes become proactive with verified alerts; (3) System resilience is enhanced, maximizing reliability.

3. Chaos Engineering Platform

The platform’s architecture spans from the data‑center layer, middleware, server layer, application layer, to service‑dependency layer. Qunar adopts a bottom‑up rollout, starting with data‑center and middleware failures.

3.1 Shutdown Drills Simulate complete power loss of a data‑center or machine pool (over 1,000 nodes per run). Key points include an application profile platform, notification mechanisms, real shutdown, automatic alarm‑driven circuit breaking, and auto‑restart after power‑on.

3.1.4 Results 49 shutdown drills involved 4,000+ machines and 500+ applications, uncovering more than 10 issues per drill; 71 shutdown drills covered 3,000+ machines and 250+ applications.

3.2 Application Dependency Drills Focus on strong‑vs‑weak dependency testing. After evaluating tools (official VM‑based, Chaosblade, Chaos Mesh), Chaosblade was selected for its VM/K8s support and open‑source nature.

Component

Supported Platforms

Scenario Richness

Open‑Source

Overall

Intrusiveness

Features

ChAP

VM

Rich

No

Good

High

Experimental comparison

Chaosblade

VM/K8S

Rich

Yes

Poor (only agent at the time)

Low

Simple, extensible, active community

Chaos Mesh

K8S

Rich

Yes

Good

None

Cloud‑native, active community

Missing scenarios such as HTTP timeout, Full GC, log congestion, call‑point distinction, and link matching were added and contributed back to the open‑source community.

3.2.2 Goal Weak dependencies can fail without affecting the main flow.

3.2.3 Key Points Dependency collection from access logs and ZK registration, stored in a DB for marking strong/weak relationships.

3.2.4 Process Four stages: collect dependencies, manually label strong/weak, execute real drills, and post‑drill fix‑and‑re‑label to form a closed loop.

3.2.5 Effect 68 systems were exercised, consuming 69.5 person‑days and uncovering 136 issues.

3.3 Automated Closed‑Loop Drills

Goal: maintain continuous reliability with minimal manual effort and maximal coverage.

Challenges include obtaining comprehensive application metadata (interfaces, protocols, DBs, traces) to auto‑generate fault scenarios, and using load‑testing and automation platforms to drive traffic without impacting real users.

Key points: case selection strategies to ensure >90% dependency hit rate, and circuit‑breaker logic driven by monitored alerts.

Result: Completed automated closed‑loop drills for 10 entry points covering 3,820 dependencies.

3.4 Attack‑Defense Drills

Background: Even with preventive drills, production failures can occur; attack‑defense drills train engineers to quickly locate and resolve faults.

Goal: Improve engineers' fault‑handling skills and solidify emergency response plans.

Approach: Attackers inject faults; defenders diagnose and report; attackers verify correctness for scoring. The same automation infrastructure as closed‑loop drills is reused.

Process: (1) Design attack scenarios from historical high‑frequency faults; (2) Defenders locate and report; (3) Automatic termination on success or timeout; (4) Scoring based on response time and difficulty; (5) Post‑mortem fixes.

Future Plan: Expand to regular, company‑wide attack‑defense events, establishing a “chaos culture”.

Distributed SystemsAutomationOperationsChaos Engineeringsystem reliabilityFault Injection
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.