Operations 22 min read

How Fault Injection Transforms Cloud‑Native Ops: Lessons from Qunar’s Chaos Engineering Platform

This article details Qunar's journey building a fault‑injection platform—covering background, three evolution stages, shutdown and dependency drills, tooling choices, operational workflows, challenges, and future roadmaps—demonstrating how systematic chaos engineering improves reliability in cloud‑native environments.

dbaplus Community

Feb 22, 2022

How Fault Injection Transforms Cloud‑Native Ops: Lessons from Qunar’s Chaos Engineering Platform

Background

Large‑scale incidents (e.g., a 7‑hour Facebook outage) demonstrate that complex service call graphs can cause cascading failures. To increase confidence against such faults, Qunar built a systematic fault‑injection platform that runs regular drills, validates system weaknesses, and refines emergency response procedures.

Fault‑Injection Platform Evolution

The platform has progressed through three stages:

Stage 1 – Shutdown Drill : Resource‑centric drills that integrate internal IM, permission systems, and monitoring alerts.

Stage 2 – Strong/Weak Dependency Drill : Collection of service dependency data, tagging of strong vs. weak dependencies, and integration with online regression checks.

Stage 3 – Container‑Aware Injection : Extension to Kubernetes using Chaosblade‑operator and container‑level fault injection.

1. Shutdown Drill

Tool selection compared Netflix ChAP, Alibaba Chaosblade, and PingCAP Chaos Mesh. Chaosblade was chosen because the environment primarily consisted of virtual machines and Java services, and Chaosblade supports both VM‑level and Kubernetes‑level fault injection.

Key Chaosblade components used:

Command‑line executor supporting Go, Java, C++, Linux and container fault injection. chaosblade-box (control plane). chaosblade-operator for Kubernetes.

Primary goals:

Normal traffic switchover.

Core service capacity assurance.

Weak‑dependency isolation.

Middleware/storage high availability.

The platform must be able to shut down up to 1 k nodes in a single data‑center.

Design considerations before execution:

Aggregate data‑center information for service adaptation.

Automated group notifications.

Virtual‑machine shutdown and physical‑host process kill.

Alert integration and correlation.

Automatic service recovery after VM restart.

Shutdown workflow:

Virtual‑machine shutdown via Ops‑provided API with retry logic.

Physical‑host shutdown using Chaosblade Agent installed via SaltStack.

Post‑shutdown health checks and service restart.

Key lessons:

Asynchronous tasks need sensible timeouts; manual fallback is required for long‑running shutdowns.

Chaosblade’s ~100 MB package may need bandwidth throttling in constrained networks.

2. Strong/Weak Dependency Drill

Definitions:

Strong dependency : downstream service fails when upstream fails.

Weak dependency : upstream failure does not break the core flow.

Dependency data sources:

HTTP access logs.

Service registration information (Zookeeper).

Metadata is aggregated daily, stored in a DB, and presented for manual tagging. The platform provides a module to mark strong/weak links and expose them to other systems.

Fault‑injection orchestration steps:

Install Chaosblade and attach to Java processes.

Inject faults (e.g., latency, exceptions) at the client side.

Recover services and detach the agent.

Injection strategies:

Parallel : inject all faults simultaneously (useful for validating weak dependencies).

Serial : inject faults one host at a time (easier to troubleshoot).

Manual control : select specific machines and fault types.

Challenges and mitigations:

Missing plugins for internal middleware – custom plugins were developed.

Incomplete delay support for async httpclient – contributed patches upstream.

Agent namespace conflicts with other JVM‑sandbox agents – resolved by using jvm-sandbox ≥ 3.0 with distinct namespaces.

CPU/Load spikes during Java‑agent attach – mitigated by temporarily diverting traffic.

3. Container‑Aware Fault Injection

To support Kubernetes, the Chaosblade‑operator was extended with agent install/uninstall and Java‑process attach capabilities. Existing SaltStack logic was reused to reduce migration effort.

Architecture (top‑down):

Portal : application management and profiling.

Dependency metadata service .

Fault‑injection orchestration engine .

Execution channels (Salt, Ops APIs, Chaosblade‑operator) that perform the actual fault injection via HTTP.

Benefits and Future Roadmap

Fault‑injection drills uncovered risks such as improper dependency handling, timeout misconfigurations, and alert deficiencies, leading to measurable improvements in observability and incident response.

Planned enhancements:

Automated strong‑dependency detection by comparing baseline and fault‑scenario test results.

Online automated random drills that compute blast radius from tracing data.

Continued open‑source contributions to Chaosblade (two Qunar committers).

Improved user experience: richer observability, reduced preparation steps, and streamlined parameterization.

Selected Q&A (Technical Highlights)

How to obtain strong/weak dependency graphs? Combine log‑based analysis with service‑registry data; manual tagging is still required.

Transition from offline to production chaos experiments? Start in test environments, move to simulation, then to production with progressive traffic exposure.

Precise full‑stack fault injection? Use Chaosblade’s Java Agent with tracing identifiers to match traffic and inject faults at exact points.

Detecting steady‑state changes after injection? Rely on monitoring, alerting, and anomaly‑detection platforms; a triggered alert indicates deviation.

Starting point for Kubernetes services? Begin with stateless services for simpler orchestration and quicker feedback.

Difference between monitoring and observability? Monitoring is a subset (metrics, logs); observability also includes tracing and holistic system insight.

Developer deployment of chaos experiments? Use Chaosblade’s lightweight agent capabilities for automation; full‑scale chaos requires organization‑wide support.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.