Chaos Engineering Framework and Practices in iQIYI FinTech Team
The iQIYI FinTech team implemented a Chaos Engineering framework, using a purpose‑driven Chaos Monkey to inject controlled failures, validate high‑availability, isolation, and self‑healing of payment services, derive architectural improvements, build a fault‑case library, and transition from fault detection to proactive system robustness.
1 Chaos System
1.1 Chaos Engineering
Chaos engineering is a discipline that conducts controlled experiments on distributed systems to reveal how various events—whether natural or human‑induced—can gradually lead to overall system unavailability. The goal is to build confidence and capability to resist uncontrolled conditions in production environments. In the past two years, major Chinese internet companies have begun to adopt these practices to improve service quality.
1.2 Understanding Framework Principles
The iQIYI FinTech team focuses on handling unpredictable scenarios through architectural and personnel resilience, emphasizing isolation, alerting, and self‑healing. Their approach spans engineering, architecture, development processes, and disaster recovery.
1.3 Chaos Monkey
The ideal Chaos Monkey acts as the executor of the Chaos system. It is scenario‑driven, purpose‑specific, and capable of performing controlled, intentional disruptions to expose blind spots in the system.
2 Objectives and Design of the Conflict War
The team faces challenges of high security, concurrency, and availability, with strict requirements on privacy, fund safety, and sensitive data. Their objectives are:
Establish Chaos Monkey attack capabilities and execution processes to validate service architecture and guide evolution.
Create a production relationship between architecture and business to mutually promote stability and robustness.
Enhance technical staff’s control over the system, improving service quality for the business.
3 Chaos Attack‑Defense Questions and Design Principles
Key questions include:
Are there architectural issues in payment/financial systems? Does the current design meet expectations?
What are the service isolation granularity and baseline for payment/financial services?
What is the system’s high‑availability level? How does node failover behave?
Design principles:
Target design flaws and code defects in production, ensuring problems are both discovered and controllable.
Do not limit means; any method—open‑source tools, network cuts, process termination, memory tampering—can be used if it serves the purpose.
Leverage monitoring and alerts to maximize risk assessment, down to traffic loss control.
4 Attack‑Defense Achievements
4.1 Execution Distribution
(Images omitted for brevity)
4.2 Executed Attack‑Defense Cases
(Images omitted for brevity)
4.3 Real‑World Case Examples
Example 1: Verify High‑Availability of a Spring Cloud‑based Payment Microservice
Involves Eureka Server/Client, Ribbon load balancer, and configuration management.
Result: The architecture can handle downstream node failures within 30 seconds without prior warning.
Optimization: Adjust LB probe intervals, Eureka client/cache times, heartbeat renewal, and server data sync to improve fault tolerance.
Example 2: Attack Non‑Core Middleware in Financial Business System
Exposes robustness issues in system design.
Result: When non‑core middleware experiences jitter or failure, the system lacks a degradation strategy, causing complete service outage.
Optimization: Add degradation and failover strategies for non‑core dependencies and tune connection pools to reduce performance loss.
5 Reflections and Summary
5.1 Summary
As the Chaos system matures, its self‑driving force diminishes, leading to a new phase of normalisation where the focus shifts from exposing existing issues to proactively strengthening future architecture, robustness, and availability. Key points:
Automated security inspections based on accumulated case libraries.
Monitoring of core infrastructure high‑availability.
Validation of new technologies and middleware using Chaos Monkey.
Building an online fault case library for periodic replay.
5.2 Thoughts
Make the value of protection more tangible to business owners.
Increase coverage with targeted adjustments based on real scenarios.
Develop visualized, templated capabilities; enrich the case library and support repeatable attack drills and tracking.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.