Operations 9 min read

Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices

Qunar shares how it built a large‑scale chaos engineering platform for thousands of microservices, detailing tool selection, architecture, evolution stages, fault‑injection scenarios, strong/weak dependency automation, open‑source contributions, and future plans for automated random drills.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices

Tool Selection

To avoid reinventing the wheel, Qunar evaluated open‑source chaos engineering tools that could run on both KVM and container platforms and fit its Java‑centric stack. After considering community activity and compatibility, the team chose ChaosBlade as the fault‑injection engine and built a custom chaos engineering console to orchestrate experiments.

Architecture Overview

Vertically, the system consists of a service‑governance portal that provides application topology, a chaos console for defining and controlling fault‑injection tasks, SaltStack and chaosblade‑operator for installing/uninstalling ChaosBlade agents, and resources hosted on both KVM VMs and Kubernetes containers. Communication between the orchestration layer and ChaosBlade agents occurs via RESTful APIs.

Horizontally, an automated testing platform supplies regression cases and marks strong/weak dependencies, while the console monitors core metrics and alerts to abort or resume drills when necessary.

System Evolution

The practice evolved in two phases: (1) building fault‑injection capabilities that let users manually create drills with various fault strategies, and (2) adding strong/weak dependency marking, verification, and an automated closed‑loop to improve microservice governance.

4.1 Fault Drills

Three primary fault‑injection scenarios were implemented: machine shutdown, OS‑level faults, and Java‑application faults, supplemented by scenario‑specific features. A typical drill workflow is illustrated below.

Challenges included insufficient open‑source fault strategies and the need to support containerized workloads. Qunar’s middleware team extended chaosblade‑exec‑jvm with plugins for AsyncHttpClient, QRedis, and HTTP DUBBO call‑point fault injection.

4.2 Strong/Weak Dependency Automation

The console periodically fetches dependency graphs from the service‑governance platform, generates exception‑based fault drills, injects faults into test environments, runs automated test cases, and compares results to determine whether a downstream dependency is strong or weak.

Key difficulties were Java agent compatibility (two agents needed distinct namespaces) and ensuring test assertions focus on core data correctness rather than simple status codes. Namespace conflicts were resolved by configuring a custom namespace in ChaosBlade, and blacklist rules were added to the recording‑playback agent to avoid interference.

Open‑Source Contributions

During the practice, Qunar contributed back to the ChaosBlade ecosystem, submitting bug fixes and enhancements to chaosblade, chaosblade‑exec‑jvm, and chaosblade‑operator, many of which have been merged upstream, and engaged with the community for joint development.

Future Plans

To date, the platform supports over 80 simulated data‑center power‑outage drills and more than 500 daily drills across 50+ core applications and 4,000+ machines. The next goal is fully automated random online drills that minimize blast radius, establish steady‑state assertions, and eventually cover all core service links with periodic random testing, while exploring broader chaos‑engineering use cases for service governance and stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeOperationschaos engineeringReliabilityFault Injection
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.