Cloud Native 9 min read

Chaos Engineering Framework and Practices in iQIYI FinTech Team

The iQIYI FinTech team implemented a Chaos Engineering framework, using a purpose‑driven Chaos Monkey to inject controlled failures, validate high‑availability, isolation, and self‑healing of payment services, derive architectural improvements, build a fault‑case library, and transition from fault detection to proactive system robustness.

iQIYI Technical Product Team

Sep 11, 2020

Chaos Engineering Framework and Practices in iQIYI FinTech Team

1 Chaos System

1.1 Chaos Engineering

Chaos engineering is a discipline that conducts controlled experiments on distributed systems to reveal how various events—whether natural or human‑induced—can gradually lead to overall system unavailability. The goal is to build confidence and capability to resist uncontrolled conditions in production environments. In the past two years, major Chinese internet companies have begun to adopt these practices to improve service quality.

1.2 Understanding Framework Principles

The iQIYI FinTech team focuses on handling unpredictable scenarios through architectural and personnel resilience, emphasizing isolation, alerting, and self‑healing. Their approach spans engineering, architecture, development processes, and disaster recovery.

1.3 Chaos Monkey

The ideal Chaos Monkey acts as the executor of the Chaos system. It is scenario‑driven, purpose‑specific, and capable of performing controlled, intentional disruptions to expose blind spots in the system.

2 Objectives and Design of the Conflict War

The team faces challenges of high security, concurrency, and availability, with strict requirements on privacy, fund safety, and sensitive data. Their objectives are:

Establish Chaos Monkey attack capabilities and execution processes to validate service architecture and guide evolution.

Create a production relationship between architecture and business to mutually promote stability and robustness.

Enhance technical staff’s control over the system, improving service quality for the business.

3 Chaos Attack‑Defense Questions and Design Principles

Key questions include:

Are there architectural issues in payment/financial systems? Does the current design meet expectations?

What are the service isolation granularity and baseline for payment/financial services?

What is the system’s high‑availability level? How does node failover behave?

Design principles:

Target design flaws and code defects in production, ensuring problems are both discovered and controllable.

Do not limit means; any method—open‑source tools, network cuts, process termination, memory tampering—can be used if it serves the purpose.

Leverage monitoring and alerts to maximize risk assessment, down to traffic loss control.

4 Attack‑Defense Achievements

4.1 Execution Distribution

(Images omitted for brevity)

4.2 Executed Attack‑Defense Cases

(Images omitted for brevity)

4.3 Real‑World Case Examples

Example 1: Verify High‑Availability of a Spring Cloud‑based Payment Microservice

Involves Eureka Server/Client, Ribbon load balancer, and configuration management.

Result: The architecture can handle downstream node failures within 30 seconds without prior warning.

Optimization: Adjust LB probe intervals, Eureka client/cache times, heartbeat renewal, and server data sync to improve fault tolerance.

Example 2: Attack Non‑Core Middleware in Financial Business System

Exposes robustness issues in system design.

Result: When non‑core middleware experiences jitter or failure, the system lacks a degradation strategy, causing complete service outage.

Optimization: Add degradation and failover strategies for non‑core dependencies and tune connection pools to reduce performance loss.

5 Reflections and Summary

5.1 Summary

As the Chaos system matures, its self‑driving force diminishes, leading to a new phase of normalisation where the focus shifts from exposing existing issues to proactively strengthening future architecture, robustness, and availability. Key points:

Automated security inspections based on accumulated case libraries.

Monitoring of core infrastructure high‑availability.

Validation of new technologies and middleware using Chaos Monkey.

Building an online fault case library for periodic replay.

5.2 Thoughts

Make the value of protection more tangible to business owners.

Increase coverage with targeted adjustments based on real scenarios.

Develop visualized, templated capabilities; enrich the case library and support repeatable attack drills and tracking.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems chaos engineering Reliability Chaos Monkey FinTech system resilience

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.