Operations 15 min read

Chaos Engineering and Fault Injection System Design: Principles, Implementation, and Practice

Chaos Engineering and Fault Injection System Design combine steady-state hypotheses, controlled blast-radius experiments, and a lightweight interceptor layer using gRPC and protobuf to inject and report faults in micro-service architectures, enabling continuous testing, rapid MTTR reduction, and resilient services through automated, real-time experimentation and analysis.

Bilibili Tech

Nov 18, 2022

Chaos Engineering and Fault Injection System Design: Principles, Implementation, and Practice

Background : With the shift to micro‑service architectures, system extensibility has improved while the uncertainty caused by service dependencies has grown exponentially. Traditional testing can no longer cover all possible system behaviors, prompting Netflix to introduce Chaos Engineering – deliberately injecting failures so that systems can learn from each outage and evolve.

Practice Principles : The approach assumes a steady‑state hypothesis (the system remains stable under normal load), controls the "blast radius" of experiments, and runs automated, continuous experiments to ensure new changes do not introduce blind spots.

Value : Instead of measuring process metrics, Chaos Engineering evaluates business value through fault‑replay, monitoring coverage, and ultimately Mean Time To Repair (MTTR) – the combined time of fault discovery, diagnosis, and recovery.

Chaos Engineering vs. Fault Drills : Chaos Engineering actively seeks unknown failures, while fault drills inject known failures based on predefined scenarios. The former uncovers hidden failure modes that scripted drills may miss.

Fault Injection System Design (FIT) : FIT adds a lightweight interceptor layer to services. When a request arrives, the interceptor checks platform‑defined matching rules and injects faults without affecting unmatched traffic.

# main.go
func main() {
    flag.Parse()
    log.Init(nil)
    defer log.Close()
    fault.Init(nil) // initialize fault‑injection module
    // ... other business logic
}

The platform uses bidirectional gRPC streaming so that clients can report experiment data in real time and receive configuration updates instantly.

Standardized fault declarations are defined in protobuf:

message Fault {
  // target: redis/mc/mysql/bm/warden
  TARGET target = 1;
  // matchers: e.g., port, sql type
  map<string, string> matchers = 2;
  // action name, e.g., ecode, timeout
  string action = 3;
  // action arguments, e.g., error code, timeout value
  map<string, string> action_args = 4;
}

Fault Matching Flow : Incoming traffic is intercepted, matched against user‑defined criteria, and the fault metadata is injected into the request context. Downstream components read this context to apply fine‑grained fault behavior.

Data Reporting : After request processing, the SDK aggregates fault‑injection details, merges similar records, and streams compressed reports to the server for real‑time impact visualization.

Fault Configuration Demo : The UI allows operators to define fault targets, actions, and parameters while automatically limiting the blast radius.

S12 Competition Fault‑Injection Practice : Several critical live‑streaming scenarios (home page, room entry, chat, revenue) were selected. Fault cases such as data‑center outage, DB connection exhaustion, pod crash, network latency, and downstream service failure were injected. The exercise revealed 22 issues, with three high‑impact problems (cache construction failure, gift‑service timeout, iOS/Android inconsistency) accounting for 14% of the total impact.

Red‑Blue Drill : A red‑team (business owners) and blue‑team (testers) conduct random fault injections. Success criteria: blue‑team injects a fault; red‑team receives an accurate alarm within 1 minute, diagnoses within 5 minutes, and resolves within 10 minutes.

Future Plans : Implement automatic strong/weak dependency analysis for micro‑services, automate fault‑injection testing, and embed fault injection as a regular testing practice rather than a pre‑event checklist.

Conclusion : Distributed systems contain countless interaction points that can fail. Systematically identifying and hardening these fragile points through continuous chaos experiments builds resilient services and reduces the risk of catastrophic outages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Go gRPC Fault Injection Reliability Testing

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.