Operations 10 min read

How Chaos Engineering Strengthens System Resilience: Building a Fault‑Injection Platform

This article explains why modern agile and DevOps environments need chaos engineering, describes the design and goals of a fault‑injection platform, outlines tool selection, details a five‑step exercise workflow, and shares a real‑world case study that demonstrates improved stability and SRE capabilities.

Alibaba Cloud Native

Aug 27, 2021

How Chaos Engineering Strengthens System Resilience: Building a Fault‑Injection Platform

Background

Rapid expansion of micro‑service based systems, agile development, and cloud‑native architectures increases delivery speed but also creates exponential complexity in service governance. Traditional disaster‑recovery cannot keep up, so failures must be treated as a normal condition and proactively exercised.

Chaos engineering—originating from Netflix’s Chaos Monkey—injects controlled faults into a distributed system to expose hidden weaknesses before they cause production outages.

Tool Selection

After evaluating several open‑source chaos frameworks, the team selected ChaosBlade because it provides a rich set of fault types (process, network, CPU, disk, JVM, container, Kubernetes, etc.), has an active community, and supports both Linux and container environments. ChaosBlade is used as the core injection engine.

Platform Objectives and Functional Goals

Provide automated, visual, orchestrated fault injection without modifying application code.

Serve as a unified entry point for high‑availability drills.

Collect reusable test cases and evaluate stability with quantitative metrics.

Support fault scenarios for JVM, C++, container, and Kubernetes workloads.

Manage the full fault lifecycle (inject → monitor → recover) with controllable blast radius.

Allow easy extension of new fault types.

Architecture Overview

The platform consists of four layers:

Orchestration Layer : Uses the internal HATT (High‑Availability Test Tool) to compose and schedule experiment tasks.

Injection Engine : ChaosBlade executes the actual fault actions on target nodes.

Monitoring & Alerting Layer : Collects system metrics, logs, and alerts in real time during experiments.

Reporting Layer : Generates post‑mortem reports and feeds remediation tickets.

Five‑Step Chaos Experiment Process

Scope Definition : Identify target services, define steady‑state metrics (e.g., latency, TPS), set success criteria and blast‑radius limits.

Task Composition : Use HATT to create a declarative experiment definition (YAML/JSON) that describes which ChaosBlade commands to run, their parameters, and sequencing.

Execution : Launch the experiment; ChaosBlade injects faults while the monitoring layer records metric deviations and alerts.

Result Collection : Aggregate logs, metric traces, and alert histories; automatically generate a detailed post‑mortem report.

Remediation : Analyze the report, implement code or configuration changes (e.g., circuit breakers, routing updates), and track the fixes in the platform’s case repository.

Detailed Experiment Walkthrough

Plan Confirmation : Document the experiment schedule, stakeholders, steady‑state thresholds, and execution order.

Case Composition : Write a HATT job file that invokes specific ChaosBlade commands (e.g., blade create cpu --cpu-percent 80 --duration 60s) and defines rollback actions.

Execution & Monitoring : Run the HATT job; the platform streams metrics (CPU, memory, network latency, TPS) to a dashboard and captures any generated alerts.

Post‑Execution Review : The platform auto‑produces a report containing before/after metric graphs, fault impact analysis, and a list of observed anomalies.

Stability Improvement : Based on the report, developers add resilience patterns (e.g., retries, circuit breakers) and close the remediation ticket.

Typical Case: Message‑Queue Broker Hang

Scenario: Simulate a single broker node in a distributed message‑queue cluster becoming unresponsive (hang). Expected behavior is that other brokers continue processing, the faulty broker is excluded from routing, and overall TPS experiences only a brief dip before recovering.

Observed result: TPS dropped to zero, indicating that client requests were still being directed to the hung broker.

Remediation implemented:

Added a client‑side circuit breaker that stops retrying after a configurable failure threshold, preventing endless blocking calls.

Enhanced the nameserver (routing service) to push broker failure status to clients in real time, allowing immediate failover.

Scale of Practice at HaoJing Technology

Since 2019, the internal “IT Blue‑Team” has integrated the platform into more than 30 product lines. Over 200 distinct fault‑injection scenarios are executed on a monthly or quarterly cadence, forming a continuous PDCA (Plan‑Do‑Check‑Act) loop for SRE teams. The platform supports both pre‑production and production environments, and contributions have been made back to the ChaosBlade project to add richer fault types and more flexible injection APIs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Platform SRE Resilience

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.