Operations 8 min read

How Alibaba’s ‘MonkeyKing’ Uses Chaos Engineering to Strengthen System Reliability

Alibaba’s MonkeyKing, inspired by Netflix’s Chaos Monkey, employs intentional fault injection—from random node kills to simulated network outages—to test and improve system robustness across IaaS, PaaS, and SaaS layers, offering a comprehensive model for reliability engineering in complex distributed environments.

Java Backend Technology

Mar 2, 2019

How Alibaba’s ‘MonkeyKing’ Uses Chaos Engineering to Strengthen System Reliability

1. What is the monkey used for?

From a developer’s perspective, there are three ways to improve system stability: enhancing robustness through technologies such as containers, scheduling, micro‑services, messaging, soft load, and configuration centers; expanding monitoring breadth and depth for rapid issue localization; and establishing fault‑injection drills to simulate failures like random node termination, delayed responses, or even data‑center outages.

The “monkey” is hired to deliberately cause disruptions and serve as a teammate for fault‑driven rehearsals.

2. Where did the monkey originate?

Netflix’s streaming service suffered a major outage in August 2008 when a primary database failure caused three days of downtime. This prompted engineers to migrate from a monolithic architecture to a distributed cloud architecture, introducing many new complexities.

The key lesson was: “The best way to avoid failure is to keep failing.” In 2010, Netflix’s Eng Tools team created Chaos Monkey to randomly kill instances, throttle requests, or shut down an entire data center, thereby validating business continuity and recovery capabilities.

3. What is Alibaba’s version of the monkey?

2011 : Alibaba began managing strong‑weak dependencies (project EOS) to detect dependency‑induced failures early.

2012 : After achieving same‑city active‑active transactions, Alibaba launched same‑city disaster‑recovery drills to verify system continuity when an entire data center goes down.

2015 : Following a major outage, a “Tiger Tiger Tiger” project was started to validate multi‑site active‑active quality.

2016 : The fault‑drill project (GOC + middleware) was formalized, redesigning architecture and processes. The product was named MonkeyKing , referencing the Chinese “Great Sage” Sun Wukong, emphasizing powerful, rebellious capabilities for stability.

4. What can Alibaba’s monkey do?

Alibaba’s diverse business scenarios and complex technical architecture generate many failure types, which can be categorized into IaaS, PaaS, and SaaS layers. Each layer may exhibit numerous fault causes and symptoms.

The fault model includes:

Hardware (IaaS) failures that manifest as software (PaaS/SaaS) symptoms.

Faults belonging to either single‑machine or distributed systems (distributed faults include single‑machine faults).

Single‑machine faults observed from the system perspective, such as in‑process issues (e.g., Full GC, CPU spikes) or out‑of‑process issues (e.g., another process stealing memory).

Human errors or improper processes.

Using this model, Alibaba designs a fault‑injection system with three layers of plugins:

OS‑level fault plugins deployed on client machines to simulate hardware and out‑of‑process failures.

Application‑process fault plugins (plug‑and‑play) that can also be custom‑implemented via the provided fault API.

Distributed‑fault control on the server side, targeting specific IP ranges.

Standardized third‑party fault implementations for components that cannot be directly accessed, such as databases.

This approach aims to cover the full spectrum of technical faults.

5. How to obtain the monkey?

Method 1: Chaos Monkey (Netflix) was open‑sourced in 2016; the third version was released in November 2016. Repository: https://github.com/Netflix/chaosmonkey

Method 2: In September 2018, Alibaba released MonkeyKing as a free service to Alibaba Cloud customers under the product name Application High‑Availability Service (AHAS), currently supporting K8s clusters.

References: https://www.gremlin.com, http://jm.taobao.org/2017/06/22/20170622

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Distributed Systems chaos engineering Reliability Fault Injection Netflix

Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.