Operations 9 min read

Chaos Engineering Practice at Qunar: Architecture, Implementation, and Future Plans

This article describes Qunar's multi‑year adoption of chaos engineering for microservice stability, covering tool selection, system architecture, fault‑injection workflows, challenges in container migration, strong/weak dependency automation, open‑source contributions, and plans for automated random chaos experiments.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Chaos Engineering Practice at Qunar: Architecture, Implementation, and Future Plans

Qunar has been operating a microservice architecture for years, with thousands of services whose increasing call‑graph complexity caused frequent failures and significant economic loss, prompting a focus on stability engineering.

Since Netflix introduced chaos engineering in 2010, the practice has proven effective for exposing system weaknesses; Qunar began exploring chaos engineering in late 2019, selecting ChaosBlade as the fault‑injection tool and building a custom chaos‑engineer console.

Based on internal requirements—support for both KVM and container platforms and a Java‑centric tech stack—ChaosBlade was chosen together with a self‑developed console.

The overall architecture includes a service‑governance portal providing application metadata, a chaos‑engineer console for orchestrating fault‑injection scenarios, SaltStack and chaosblade‑operator for installing/uninstalling agents, and RESTful communication with agents running on KVM or Kubernetes workloads.

System evolution occurred in two stages: first, building manual fault‑injection capabilities for validating system behavior; second, adding strong/weak dependency marking and automated verification to improve microservice governance.

Fault‑injection scenarios cover machine shutdown, OS‑level faults, and Java‑level faults, with extensions for AsyncHttpClient, QRedis, and DUBBO via custom plugins.

During container migration in 2021, three implementation options were evaluated: pure open‑source chaosblade‑operator, a sidecar‑style "sidecar" approach, and a hybrid chaosblade‑operator plus blade server solution; the hybrid option was selected for minimal conversion cost.

Strong/weak dependency automation involves the chaos console periodically fetching dependency graphs, generating exception‑based fault scenarios, injecting faults, running automated test cases, and using test assertions combined with fault‑strategy logs to determine dependency strength.

Key challenges include Java agent compatibility (resolving namespace conflicts between jvm‑sandbox agents) and differing assertion requirements for fault‑driven tests versus regular regression tests.

Qunar contributed back to the ChaosBlade open‑source project with bug fixes and enhancements, and engaged with the community for collaborative development.

Future plans aim to automate random online chaos experiments, minimize blast radius using service‑dependency analysis, establish steady‑state assertions, and eventually conduct regular random chaos drills across all core service links to further strengthen system reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kubernetes
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.