Operations 14 min read

How Agricultural Bank Built a Chaos Engineering Platform for Resilience

The article outlines the Agricultural Bank of China's initiative to adopt chaos engineering, describing the challenges of modern distributed systems, the design and capabilities of their in‑house chaos platform, product research, industry comparisons, practical use cases across development, operations and disaster recovery, and future development directions.

dbaplus Community

Jun 20, 2023

How Agricultural Bank Built a Chaos Engineering Platform for Resilience

Background

Traditional testing of distributed systems often fails to validate stability under real‑world load. Chaos engineering introduces controlled fault‑injection experiments to expose unknown failure modes, verify operational and emergency procedures, and improve system resilience.

Product Research

Most chaos‑engineering platforms are built on open‑source engines (e.g., Litmus, Chaos Mesh). While these tools provide rich fault‑injection primitives, they lack integrated monitoring, scenario management, visualization, and automation required for enterprise adoption.

Industry Landscape

Major Chinese banks have launched fault‑exercise platforms that cover system, application, and container scenarios, integrate automated fault execution, monitoring, and recovery, and support multi‑cloud environments.

Bank Practice

Pilot projects at the Agricultural Bank used chaos engineering to uncover hidden risks, validate risk‑management policies, and define platform requirements for an enterprise‑grade solution.

Platform Architecture & Capabilities

The platform is designed for multi‑cluster, multi‑environment fault exercises, supporting both Kubernetes and host‑level targets. Core capabilities include:

Environment Management : Unified control of Kubernetes clusters and hosts, multi‑cloud support, and agent deployment (DaemonSet on clusters, lightweight agents on hosts) for fault injection and log collection.

Fault Scenario Management : Library of atomic infrastructure and application faults; user‑defined scripts in shell, python, or yaml; lifecycle management of scenarios.

Fault Exercise : Manual, scheduled, or automated execution with one‑click injection, real‑time metric collection, logging, retry, and automatic termination on steady‑state breach.

Experiment Management : Approval workflow, scheduling, and risk‑control policies.

Experiment Observation : Collection of resource, business, performance, and steady‑state metrics to support root‑cause analysis.

Report Management : Template‑driven report generation summarizing metric changes, load‑test results, and experiment outcomes; permission‑controlled download.

Traffic Simulation : Integration with internal load‑testing platforms for production‑traffic replay.

Experiment Protection : Automatic or manual termination when predefined thresholds are exceeded, preventing out‑of‑control runs.

Permission Management : Tenant isolation, role‑based access control, and audit logging.

Knowledge Base : Embedded documentation of open‑source tools and internal best‑practice guides.

Application Scenarios

Periodic Red‑Blue Drills : Teams split into “blue” (design & inject faults) and “red” (detect & recover). Repeated drills reduce mean time to detection (MTTD) and mean time to recovery (MTTR).

Lifecycle Integration : Fault injection during development testing, operational monitoring, and disaster‑recovery rehearsals provides continuous risk mitigation.

Outlook

Future work focuses on enabling users to author custom fault scenarios, quantifying steady‑state metrics, and automating the balance between fault intensity and system stability. The roadmap progresses from basic fault injection to advanced, industry‑leading capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Cloud Native chaos engineering Reliability Engineering Platform Development

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.