Cloud Native 10 min read

Injecting Real-World Failures into Ali Kubernetes with Open‑Source Monkey Tools

This article explains how chaos engineering principles are applied to Ali Kubernetes by reviewing open‑source Kubernetes monkey tools, analyzing complex failure scenarios, and presenting a custom fault‑injection suite built on the internal MonkeyKing platform to enable flexible, scenario‑driven chaos experiments.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Injecting Real-World Failures into Ali Kubernetes with Open‑Source Monkey Tools

Chaos Engineering vs. Traditional Testing

Traditional testing validates a system against fixed assertions. Chaos engineering deliberately injects uncertainty—such as resource pressure, latency, or security faults—to verify that the system continues to serve users without noticeable impact.

Open‑Source Kubernetes Chaos Tools

kube‑monkey – https://github.com/asobti/kube-monkey – Deploys a dedicated pod that targets victim pods labeled for termination. Currently supports only container kill and can be configured for random kill cycles.

powerfulseal – https://github.com/bloomberg/powerfulseal – Can kill pods and stop/start nodes. Provides interactive, automatic, labeling, and demo modes to define scope and frequency.

Chaos Toolkit‑kubernetes – https://github.com/chaostoolkit/chaostoolkit-kubernetes – Part of the Chaos Toolkit suite; runs experiments defined in JSON to randomly delete pods based on namespace or regex matching.

Fault‑Scenario Analysis for Ali Kubernetes

Ali Kubernetes manages large‑scale clusters and must handle not only typical pod deletions but also OS, kernel, network, and misconfiguration disasters. A Failure Mode and Effects Analysis (FMEA) identified three major fault categories:

General faults – network hangs, latency, host reboots, high load.

Ali Kubernetes business faults – pod deletions/patches, pod migration, mixed‑node deployments, etcd issues.

Chaos faults – random fault injection that can combine any of the above.

Existing open‑source tools mainly support simple pod‑kill scenarios and cannot address more complex cases such as mis‑configured kubelet‑induced pod deletion or master‑component network outages.

Custom Fault‑Injection Suite Built on MonkeyKing

Generic Fault Programs

Small programs that trigger OS‑level failures, e.g., cutting network connectivity on a host or rebooting a host.

Kubernetes Suite Programs

Wrapper programs around kubectl that accept parameters for label, namespace, and operation. The suite bundles the kubectl binary, downloads cluster certificates (with MD5 verification), and validates them before execution. Supported operations: apply, create, delete, patch, get, with optional -o json output.

Open‑Source Tool Integration

Integration of kube‑monkey via the Kubernetes suite follows these steps:

Lock the test environment to prevent interference.

Deploy km-config.yaml using the suite’s apply command to create the kube‑monkey deployment.

Label victim pods with a specific label using the suite’s patch command.

Validate that the label was applied correctly.

Start kube‑monkey, which randomly terminates labeled pods.

After the experiment, remove the label, delete the kube‑monkey deployment, and unlock the environment.

Business‑Specific Programs

Additional small programs address Ali‑specific needs:

Environment lock/unlock to ensure exclusive use during a chaos run.

Service availability checks using curl and evaluating HTTP response codes.

VIP validation programs that query a VIP service to verify the number and health of IPs attached to a virtual IP.

Execution Example: Master Component Network Outage

A multi‑layer fault is injected by combining a generic fault program that disables network on a master host with Kubernetes suite commands that query service health. The workflow demonstrates how the suite can simulate complex failures that open‑source tools alone cannot achieve.

Outcome

By treating failures as modular scenarios and composing OS‑level, Kubernetes‑level, and open‑source monkey tools within MonkeyKing, numerous stability issues were uncovered in Ali Kubernetes, leading to concrete improvements in system robustness. Future work will expand the fault library with additional business‑specific scenarios to further strengthen high‑availability guarantees.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kuberneteschaos engineeringFault InjectionMonkey Tools
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.