Tag

fault injection

0 views collected around this technical thread.

FunTester
FunTester
May 19, 2025 · Operations

Chaos Engineering Tools, Theory, and Practices

Chaos engineering, a scientific method for improving system resilience, is explored through an overview of leading tools such as Gremlin, ChaosBlade, Chaos Mesh, Chaos Toolkit, and ChaosMeta, alongside core concepts, real-world case studies, common misconceptions, and the practical value of controlled fault injection in distributed systems.

Chaos EngineeringDistributed Systemsfault injection
0 likes · 12 min read
Chaos Engineering Tools, Theory, and Practices
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Chaos EngineeringDistributed SystemsSRE
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
FunTester
FunTester
Apr 21, 2025 · Backend Development

Sentinel: Flow Control and Circuit Breaking for Microservice Stability

This article explains how Sentinel, an open‑source flow‑control component from Alibaba, provides fine‑grained rate limiting, circuit breaking, and system protection for microservices, detailing its core mechanisms, configuration options, and practical usage in performance and fault testing.

BackendMicroservicesPerformance Testing
0 likes · 14 min read
Sentinel: Flow Control and Circuit Breaking for Microservice Stability
JD Tech
JD Tech
Apr 17, 2025 · Operations

Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration

This article explains chaos engineering—its definition, core principles, experimental workflow, tool selection, AI‑driven enhancements, and practical case studies—providing a comprehensive guide for building resilient distributed systems across backend, cloud‑native, mobile, and AI‑enabled environments.

AI integrationChaos EngineeringDistributed Systems
0 likes · 26 min read
Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration
FunTester
FunTester
Mar 31, 2025 · Operations

Performance Testing and Fault Testing: Complementary Pillars for System Stability

The article explains how performance testing measures system efficiency under load while fault testing validates resilience under abnormal conditions, highlighting their shared goals, differences, overlapping toolchains, and how their combined use drives architecture optimization and improves service level agreements in modern complex software systems.

Performance Testingfault injectionload testing
0 likes · 14 min read
Performance Testing and Fault Testing: Complementary Pillars for System Stability
FunTester
FunTester
Mar 23, 2025 · Operations

The Origin, Development, and Future of Chaos Engineering

Chaos engineering, introduced by Netflix in 2011 to proactively inject failures and test system resilience, has evolved over the past decade into a widely adopted practice integrated with SRE, automation, AI, and Kubernetes, offering best‑practice guidelines and future trends for improving distributed system reliability.

Chaos EngineeringCloud NativeKubernetes
0 likes · 8 min read
The Origin, Development, and Future of Chaos Engineering
Bilibili Tech
Bilibili Tech
Mar 18, 2025 · Operations

Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream

Bilibili’s engineering team built a scenario‑metadata and one‑click fault‑drill platform, implemented multi‑tier degradation, dynamic capacity planning, and extensive automated fault‑injection testing to guarantee zero‑severity incidents during the high‑traffic 2025 Spring Festival Gala live stream.

Capacity PlanningHigh ConcurrencyLive Streaming
0 likes · 16 min read
Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream
FunTester
FunTester
Mar 12, 2025 · Operations

Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices

Fault injection testing deliberately introduces failures into a system to assess its resilience, helping identify weak points, improve retry and timeout mechanisms, and ensure robust operation across software, protocol, and infrastructure layers, with practical guidance on processes, tools, and Kubernetes-specific practices.

Chaos EngineeringKubernetesfault injection
0 likes · 8 min read
Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices
FunTester
FunTester
Sep 18, 2024 · Operations

Overview and Practice of Chaos Engineering

Chaos Engineering introduces controlled failures to test system resilience, covering its history, practical benefits, experiment design, and a comparison of popular open‑source and commercial tools for improving reliability in distributed and cloud‑native environments.

Distributed Systemsfault injectionoperations
0 likes · 13 min read
Overview and Practice of Chaos Engineering
DevOps
DevOps
Aug 22, 2024 · Operations

Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability

This article explains why service stability is critical, outlines the importance and key factors of synthetic monitoring, provides practical guidelines for implementing it, and then describes fault‑drill concepts, benefits, processes, and common cloud‑native tools to proactively discover and mitigate failures in micro‑service environments.

Cloud NativeDevOpsfault injection
0 likes · 11 min read
Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability
Aikesheng Open Source Community
Aikesheng Open Source Community
Jun 27, 2024 · Databases

Evaluation of OceanBase Arbitration Service in a 2F1A Deployment: Fault Injection Experiments and Recovery Procedures

This article presents a detailed experimental study of OceanBase's Arbitration Service in a 2F1A (two full‑function replicas plus one arbitration node) configuration, examining how the system behaves when one or both full‑function replicas fail, how log‑stream degradation and permanent offline mechanisms work, and how normal service is restored after node recovery.

Arbitration ServiceDistributed DatabaseHigh Availability
0 likes · 17 min read
Evaluation of OceanBase Arbitration Service in a 2F1A Deployment: Fault Injection Experiments and Recovery Procedures
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

SREavailabilityfault injection
0 likes · 7 min read
From Firefighting to Fire‑Starting: Mastering Operations for System Reliability
FunTester
FunTester
Mar 29, 2024 · Operations

Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes

This article describes how WeChat Pay applied chaos engineering to improve system reliability, detailing the business scenario, challenges of controlling fault injection radius, practical solutions, risk assessment, automation, and the resulting business and tool achievements.

Chaos EngineeringHigh AvailabilityWeChat Pay
0 likes · 18 min read
Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes
High Availability Architecture
High Availability Architecture
Mar 21, 2024 · Operations

Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes

To improve the ultra‑high availability of WeChat Pay, the team introduced chaos engineering using multi‑partition isolation, controlled blast radius, automated fault injection, and systematic risk discovery, detailing the design, execution, automation, and results of this reliability‑focused initiative.

Chaos EngineeringHigh AvailabilityWeChat Pay
0 likes · 18 min read
Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2024 · Operations

Chaos Engineering in WeChat Pay: Design, Implementation, and Results

WeChat Pay’s team adopted Netflix‑style chaos engineering, building an automated, YAML‑driven fault‑injection platform that isolates experiments in multi‑zone partitions, enabling over 500 safe experiments in 2021‑2022, uncovering critical bugs across core services while maintaining five‑nine availability and zero production incidents.

Chaos EngineeringHigh AvailabilityObservability
0 likes · 18 min read
Chaos Engineering in WeChat Pay: Design, Implementation, and Results
Bilibili Tech
Bilibili Tech
Nov 28, 2023 · Operations

Technical Assurance Practices for the 13th League of Legends World Championship Live Stream

For the 13th League of Legends World Championship live stream on Bilibili, a comprehensive technical‑assurance framework—covering pre‑event traffic buildup, in‑event experience, and post‑event replay—mapped over 60 business functions, applied a traffic‑estimation model, executed fault‑injection drills, load tests, strict SOPs and change control, and real‑time monitoring, enabling 120 million viewers and a peak of 460 million concurrent users.

Live StreamingPerformance TestingTraffic Engineering
0 likes · 19 min read
Technical Assurance Practices for the 13th League of Legends World Championship Live Stream
AntTech
AntTech
Nov 7, 2023 · Operations

ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions

ChaosMeta V0.6.0 introduces DNS and log injection capabilities, lossless fault injection concepts, automated experiment orchestration with atomic tasks, and a roadmap for multi‑cloud support and advanced metrics, aiming to solve the last‑mile challenge of continuous automated chaos experiments in production environments.

Chaos EngineeringCloud NativeObservability
0 likes · 9 min read
ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions
Bilibili Tech
Bilibili Tech
Aug 1, 2023 · Operations

Fault Injection Platform Design and Implementation Practice

The article details the design and deployment of a Java‑focused fault‑injection platform that uses Attach agents and JVM‑Sandbox to inject controllable pod‑ and request‑level faults—such as latency, exceptions, and return‑value errors—through dynamic templates, enabling fine‑grained, production‑safe chaos testing for e‑commerce services.

Java applicationJava developmentagent-based
0 likes · 15 min read
Fault Injection Platform Design and Implementation Practice
DevOps
DevOps
May 12, 2023 · Operations

Evolution of Chaos Engineering at Netflix: From Chaos Monkey to ChAP

This article examines how Netflix has progressively refined its chaos engineering practices—from the early Chaos Monkey tool to the sophisticated Chaos Automation Platform (ChAP)—to improve system resilience, automate experiments, and safely validate changes in large‑scale microservice environments.

Chaos EngineeringCloud NativeMicroservices
0 likes · 26 min read
Evolution of Chaos Engineering at Netflix: From Chaos Monkey to ChAP
JD Tech
JD Tech
Mar 14, 2023 · Operations

Introduction to Chaos Engineering and Its Practical Exercise Workflow

This article offers a comprehensive overview of chaos engineering, explaining its definition, why it is needed, the value it brings, a detailed step‑by‑step practice workflow—including preparation, execution, recovery and review phases—typical drill scenarios, key assessment metrics, and risk‑control measures to improve system reliability and high‑availability.

Chaos Engineeringfault injectionoperations
0 likes · 11 min read
Introduction to Chaos Engineering and Its Practical Exercise Workflow