Boost Microservice Resilience with ChaosBlade and SkyWalking: A Hands‑On Guide
This article explains how to use ChaosBlade for fault injection and SkyWalking for monitoring to improve the high‑availability of distributed microservice systems, covering tool installation, experiment design, step‑by‑step execution, and real‑world case studies with detailed commands and metrics.
Chaos engineering injects controlled faults into distributed systems to reveal weaknesses and improve reliability. This guide demonstrates using the open‑source tools ChaosBlade (fault injection) and Apache SkyWalking (observability) in a microservice demo.
Tool Overview
ChaosBlade provides a unified blade CLI for injecting faults such as CPU load, memory pressure, network loss, disk I/O, process termination, Java/C++ method delays, Docker container actions, and Kubernetes node disruptions. It is lightweight, non‑intrusive, and extensible.
SkyWalking is an APM system that offers distributed tracing, metrics, service topology, root‑cause analysis, and alerting for cloud‑native architectures.
Installation
## Download
wget https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
## Extract
tar -zxf chaosblade-0.9.0-linux-amd64.tar.gz
## Add to PATH
export PATH=$PATH:chaosblade-0.9.0/
## Verify
blade -hUse blade -h to list commands and explore sub‑commands (e.g., blade create cpu fullload -h) for flags and examples.
Chaos Experiment Workflow
Define a chaos experiment plan.
Specify steady‑state metrics (e.g., average response time, P99 latency) in SkyWalking.
Formulate fault‑tolerance hypotheses (e.g., timeout settings, circuit‑breaker policies).
Execute the experiment with ChaosBlade.
Validate metrics after fault injection.
Record results, restore the system, and fix identified issues.
Automate continuous verification.
Case Study 1 – Dubbo Cart Service Delay
Microservice demo includes frontend, cart, product, order, etc., built with SpringBoot, Nacos, MySQL, Redis, Lettuce, and Dubbo.
Generate load: ab -n 10000 -c 2 http://127.0.0.1:8083/cart Steady‑state: average RT ≈ 15 ms, P99 ≤ 20 ms (observed in SkyWalking).
Hypothesis: a 2 s client timeout and a circuit‑breaker should prevent long‑lasting blocks.
Inject a 30 s delay into Dubbo method viewCart:
blade create dubbo delay --time 30000 \
--service com.alibabacloud.hipstershop.cartserviceapi.service.CartService \
--methodname viewCart --process frontend --consumerSkyWalking shows RT spikes to ~2000 ms, P99 rises similarly, and the /cart endpoint returns timeout errors.
Conclusion: timeout works, but no circuit‑breaker is configured, violating the hypothesis.
Case Study 2 – Network Loss on Nacos Registry
Simulate a registration‑center failure by injecting 100 % packet loss on Nacos port 8848:
blade create network loss --interface eth0 --percent 100 --local-port 8848Metrics show the cart service remains functional because it caches data locally and has weak dependency on the registry, confirming the hypothesis.
Mini‑Case – MySQL Slow‑SQL Injection
Delay SELECT statements on MySQL to test slow‑SQL alerts:
blade create mysql delay --time 10000 --sqltype select --port 3306This adds a 10 s delay to SELECT queries on port 3306, allowing verification of alerting behavior.
Repository
ChaosBlade source code: https://github.com/chaosblade-io/chaosblade
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
