Tagged articles
139 articles
Page 1 of 2
Woodpecker Software Testing
Woodpecker Software Testing
Apr 10, 2026 · Operations

How Adversarial Testing Drives Hidden Performance Gains

Adversarial testing transforms performance optimization by injecting extreme, realistic failures—such as cache avalanches, CDN outages, or slow SQL—to expose fragile boundaries, tighten observability, and create a rapid, evidence‑driven feedback loop that prevents costly production incidents.

MicroservicesObservabilityPerformance Optimization
0 likes · 8 min read
How Adversarial Testing Drives Hidden Performance Gains
Woodpecker Software Testing
Woodpecker Software Testing
Mar 22, 2026 · Artificial Intelligence

How to Test Retrieval‑Augmented Generation Systems: Practical Strategies for 2024

This article explains why traditional API, assertion, and UI testing fail for Retrieval‑Augmented Generation (RAG) systems, and presents a four‑step, evidence‑driven testing framework—including golden test sets, dual‑track validation, chaos engineering, and continuous trust dashboards—to ensure factual reliability and operational robustness in real‑world deployments.

Fact CheckingLLMOpenTelemetry
0 likes · 8 min read
How to Test Retrieval‑Augmented Generation Systems: Practical Strategies for 2024
Woodpecker Software Testing
Woodpecker Software Testing
Mar 17, 2026 · R&D Management

Shift‑Left Testing in Practice: How to Catch Defects Early in the Requirements Phase

The article examines how a fintech loan‑risk system applied shift‑left testing—embedding quality checks such as Gherkin‑based living requirements, contract testing, static analysis, and chaos engineering—to intercept defects during requirements, design, and development, achieving 87% early defect detection, a 40% reduction in UAT time, and zero P0 incidents, while warning against common pitfalls.

FinTechShift-Left Testingchaos engineering
0 likes · 8 min read
Shift‑Left Testing in Practice: How to Catch Defects Early in the Requirements Phase
Woodpecker Software Testing
Woodpecker Software Testing
Mar 17, 2026 · Artificial Intelligence

5 Proven Strategies to Boost Large Language Model Performance

The article presents five actionable strategies—defining a three‑dimensional performance baseline, applying layered injection load tests, co‑optimizing dynamic quantization with cache, employing SLO‑driven chaos engineering, and shifting testing left to compilation—to reliably measure and improve LLM throughput, latency, and resource efficiency in production.

LLM optimizationLoad TestingPerformance Testing
0 likes · 7 min read
5 Proven Strategies to Boost Large Language Model Performance
Woodpecker Software Testing
Woodpecker Software Testing
Mar 4, 2026 · Artificial Intelligence

Practical Testing of AI Agents: From ChatOps Assistants to Autonomous Driving Bots

The article examines the 2024 shift to dynamic AI agents, outlines why traditional testing falls short, and presents three real‑world case studies—ChatOps IT assistant, multi‑agent e‑commerce risk platform, and embodied inspection robot—detailing novel testing frameworks and measurable improvements.

AI AgentsChatOpsHybrid Testing
0 likes · 8 min read
Practical Testing of AI Agents: From ChatOps Assistants to Autonomous Driving Bots
Woodpecker Software Testing
Woodpecker Software Testing
Mar 3, 2026 · Fundamentals

5 Major Test Coverage Pitfalls That Undermine Software Quality

The article reveals five common misconceptions in test coverage optimization—confusing coverage with verification, chasing 100% branch coverage, over‑counting non‑business code, ignoring distributed‑system interactions, and treating coverage as a KPI—showing how they lead to defects despite high coverage percentages.

MicroservicesSoftware Testingchaos engineering
0 likes · 8 min read
5 Major Test Coverage Pitfalls That Undermine Software Quality
Tech Freedom Circle
Tech Freedom Circle
Jan 18, 2026 · Interview Experience

How to Achieve Zero P4 Incidents for a Year – A Complete Interview Framework

The article presents a systematic BAR (Background‑Action‑Result) framework for answering the interview question about maintaining a full year of zero P4‑level faults, covering fault‑grade definitions, a three‑layer protection strategy, concrete tooling (Sentinel, SkyWalking, ChaosBlade, etc.), quantitative results, and a set of high‑frequency follow‑up questions to showcase deep technical expertise.

KubernetesMicroservicesReliability
0 likes · 23 min read
How to Achieve Zero P4 Incidents for a Year – A Complete Interview Framework
Woodpecker Software Testing
Woodpecker Software Testing
Jan 5, 2026 · Operations

Three Core Dimensions of Performance Testing: Time Behavior, Resource Utilization, and Capacity

This article breaks down performance testing into three essential dimensions—time behavior, resource utilization, and capacity—explains their key metrics, demonstrates a detailed e‑commerce flash‑sale case study, and shows how systematic testing and optimization can dramatically improve response times, throughput, and scalability.

JMeterLoad TestingPerformance Testing
0 likes · 12 min read
Three Core Dimensions of Performance Testing: Time Behavior, Resource Utilization, and Capacity
DevOps Coach
DevOps Coach
Dec 29, 2025 · Operations

Mastering System Reliability: Lessons from Google, Netflix, and Meta

Learn how Google, Netflix, and Meta pioneered modern reliability practices—SRE’s data‑driven metrics, Netflix’s chaos engineering, and Meta’s self‑healing automation—and get a step‑by‑step handbook to apply these concepts, avoid common traps, and build resilient systems at any scale.

AutomationSite Reliability Engineeringchaos engineering
0 likes · 11 min read
Mastering System Reliability: Lessons from Google, Netflix, and Meta
Xiao Liu Lab
Xiao Liu Lab
Dec 10, 2025 · Operations

Why Do Your Services Disappear After Reboot? Master systemd Auto‑Start and Chaos Testing

This guide reveals why critical services often fail to start after a server reboot, presents essential systemd unit file parameters, provides ready‑to‑copy configurations for Nginx, Java, and Flask, outlines a four‑step troubleshooting workflow, and introduces a lightweight chaos‑engineering playbook to verify auto‑start resilience.

Linux operationschaos engineeringservice auto-start
0 likes · 15 min read
Why Do Your Services Disappear After Reboot? Master systemd Auto‑Start and Chaos Testing
FunTester
FunTester
Jul 11, 2025 · Operations

Why Chaos Engineering Is Essential for Building Resilient Systems

This article explains how chaos engineering deliberately injects failures to reveal hidden weaknesses, helping organizations test and improve infrastructure resilience, handle traffic spikes, recover from disasters, and maintain continuous service in today’s always‑on digital environment.

Fault InjectionResilience Testingchaos engineering
0 likes · 7 min read
Why Chaos Engineering Is Essential for Building Resilient Systems
FunTester
FunTester
Jul 2, 2025 · Operations

How Leading Chinese Companies Harness Chaos Engineering to Boost System Resilience

Chinese enterprises such as Alibaba, JD Cloud, and Xiaomi are increasingly adopting chaos engineering tools like ChaosBlade and Chaos Mesh to simulate failures in production-like environments, overcoming challenges of awareness, risk control, talent gaps, and platform integration, while AI and cloud‑native technologies drive smarter, automated resilience testing.

AICloud Nativechaos engineering
0 likes · 3 min read
How Leading Chinese Companies Harness Chaos Engineering to Boost System Resilience
TAL Education Technology
TAL Education Technology
Jun 23, 2025 · Operations

How Chaos Engineering Boosts System Resilience: A Practical Guide

This article explains what Chaos Engineering is, why it matters for modern distributed systems, outlines a step‑by‑step approach to designing and running effective chaos experiments, describes platform features, and shares a real‑world case study of a pre‑launch blind test.

Distributed SystemsReliabilityResilience Testing
0 likes · 9 min read
How Chaos Engineering Boosts System Resilience: A Practical Guide
FunTester
FunTester
May 28, 2025 · Cloud Native

Extending Automated Thread Dumps: Log Collection, Resource Monitoring, Chaos Engineering, Performance Analysis, and Environment Cleanup

The article explores how automated thread dumps can be expanded into multiple testing scenarios—including log collection, resource monitoring, fault injection, performance result analysis, and environment cleanup—by leveraging Kubernetes APIs, Prometheus, Chaos Mesh, and scripting tools to improve efficiency, observability, and system resilience.

AutomationKubernetesPerformance Testing
0 likes · 9 min read
Extending Automated Thread Dumps: Log Collection, Resource Monitoring, Chaos Engineering, Performance Analysis, and Environment Cleanup
FunTester
FunTester
May 19, 2025 · Operations

Chaos Engineering Tools, Theory, and Practices

Chaos engineering, a scientific method for improving system resilience, is explored through an overview of leading tools such as Gremlin, ChaosBlade, Chaos Mesh, Chaos Toolkit, and ChaosMeta, alongside core concepts, real-world case studies, common misconceptions, and the practical value of controlled fault injection in distributed systems.

Distributed SystemsFault InjectionReliability
0 likes · 12 min read
Chaos Engineering Tools, Theory, and Practices
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Distributed SystemsFault InjectionOperations
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
FunTester
FunTester
May 8, 2025 · Backend Development

Mastering HTTP Timeouts: Types, Causes, and Chaos Mesh Simulations

Understanding the three HTTP timeout types—connect, write, and read—helps engineers pinpoint failures, while detailed examples of causes and observable symptoms guide troubleshooting, and step-by-step Chaos Mesh simulations demonstrate how to inject and monitor these faults to validate system resilience.

BackendFault InjectionHTTP
0 likes · 17 min read
Mastering HTTP Timeouts: Types, Causes, and Chaos Mesh Simulations
JD Tech
JD Tech
Apr 17, 2025 · Operations

Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration

This article explains chaos engineering—its definition, core principles, experimental workflow, tool selection, AI‑driven enhancements, and practical case studies—providing a comprehensive guide for building resilient distributed systems across backend, cloud‑native, mobile, and AI‑enabled environments.

AI integrationDistributed SystemsFault Injection
0 likes · 26 min read
Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration
FunTester
FunTester
Apr 12, 2025 · Operations

How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems

This article explains why fault testing is essential for modern distributed and cloud environments, outlines core goals, design principles, common fault categories, practical implementation strategies such as chaos engineering and gray releases, and shows how to analyze results to continuously improve system reliability.

Distributed Systemschaos engineeringfault testing
0 likes · 18 min read
How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems
FunTester
FunTester
Mar 25, 2025 · Operations

Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems

This article explores how to embed chaos engineering practices into service dependency governance, detailing dynamic validation versus static analysis, fault injection techniques, multi‑point failure simulations, and data‑driven optimizations to build robust, self‑healing microservice architectures in cloud‑native environments.

Cloud NativeMicroservicesOperations
0 likes · 18 min read
Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems
FunTester
FunTester
Mar 18, 2025 · Operations

How to Build a Fault‑Isolation Shield for High‑Traffic Distributed Systems

The article explains how to construct a comprehensive fault‑isolation and protection system for modern distributed applications, covering entry‑side rate limiting, exit‑side circuit breaking, internal resource isolation, monitoring, chaos‑engineering validation, and automatic self‑healing mechanisms using tools such as Sentinel, Nginx, Hystrix, SkyWalking, Prometheus and Kubernetes.

Circuit BreakingDistributed SystemsMicroservices
0 likes · 7 min read
How to Build a Fault‑Isolation Shield for High‑Traffic Distributed Systems
FunTester
FunTester
Mar 14, 2025 · Operations

Fault Testing: Enhancing System Resilience through Controlled Failure Simulations

The article explains how fault testing—by deliberately injecting failures in a controlled environment—helps identify system weaknesses, validates post‑mortem improvements, and drives architectural optimization, thereby increasing high‑availability and resilience of modern internet services.

Operationschaos engineeringfault testing
0 likes · 8 min read
Fault Testing: Enhancing System Resilience through Controlled Failure Simulations
FunTester
FunTester
Mar 12, 2025 · Operations

Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices

Fault injection testing deliberately introduces failures into a system to assess its resilience, helping identify weak points, improve retry and timeout mechanisms, and ensure robust operation across software, protocol, and infrastructure layers, with practical guidance on processes, tools, and Kubernetes-specific practices.

Fault InjectionKubernetesOperations
0 likes · 8 min read
Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices
FunTester
FunTester
Mar 7, 2025 · Operations

Fault Testing: Proactive Resilience Engineering for Distributed Systems

Fault testing, akin to a shield, deliberately injects failures into distributed and cloud‑native systems to expose weak points, verify recovery mechanisms, and improve overall reliability, ensuring business continuity even under unexpected disruptions.

OperationsResiliencechaos engineering
0 likes · 11 min read
Fault Testing: Proactive Resilience Engineering for Distributed Systems
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

chaos engineeringcircuit breakerfault tolerance
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
FunTester
FunTester
Feb 26, 2025 · Industry Insights

8 Software Testing Trends Shaping 2025: AI, Low‑Code, Shift‑Left/Right & More

The article outlines eight major software testing trends for 2025—including AI‑driven test automation, low‑code tools, shift‑left/right practices, chaos engineering, DevSecOps security testing, performance engineering, and autonomous testing—while advising engineers on skill upgrades and cross‑functional collaboration.

AI testingDevSecOpsShift-Left
0 likes · 16 min read
8 Software Testing Trends Shaping 2025: AI, Low‑Code, Shift‑Left/Right & More
FunTester
FunTester
Feb 13, 2025 · Operations

Why Fault Testing Is Critical for Modern Online Systems

In today's digital era, online services face increasing fault risks, and systematic fault testing—through chaos engineering, fault injection, stress testing, and disaster recovery drills—helps teams anticipate, evaluate, and improve system resilience, ultimately reducing downtime and protecting business continuity.

AutomationCloud NativeOperations
0 likes · 9 min read
Why Fault Testing Is Critical for Modern Online Systems
FunTester
FunTester
Jan 27, 2025 · Operations

Mastering Chaos Engineering: Build Resilient Systems with Proven Practices

In today's always‑on digital era, this article explains chaos engineering concepts, step‑by‑step experimental methods, best‑practice guidelines, and a comparison of leading fault‑injection tools to help organizations proactively strengthen system resilience and reduce downtime risk.

Cloud NativeDevOpsFault Injection
0 likes · 11 min read
Mastering Chaos Engineering: Build Resilient Systems with Proven Practices
FunTester
FunTester
Jan 15, 2025 · Operations

How to Combine Performance Testing and Chaos Engineering for Rock‑Solid Systems

Drawing lessons from the 2021 AWS outage, this article explains how integrating performance testing with fault‑injection (chaos engineering) in microservice and Kubernetes environments can identify bottlenecks, validate resilience, and build a continuous stability strategy that balances speed and reliability.

KubernetesMicroservicesOperations
0 likes · 13 min read
How to Combine Performance Testing and Chaos Engineering for Rock‑Solid Systems
FunTester
FunTester
Nov 4, 2024 · Backend Development

Mastering Java Fault Injection with Byteman: A Hands‑On Guide

Byteman is a dynamic Java fault‑injection tool that lets developers simulate network delays, service crashes, and resource exhaustion without altering source code, offering seamless integration with JUnit/TestNG, detailed rule definitions, and convenient shell scripts for installing, submitting, and removing fault‑injection rules.

BytemanFault InjectionJVM
0 likes · 12 min read
Mastering Java Fault Injection with Byteman: A Hands‑On Guide
dbaplus Community
dbaplus Community
Oct 3, 2024 · Operations

How Netflix Uses Chaos Engineering to Build Resilient Distributed Systems

This article explains Netflix's chaos engineering practice, detailing the challenges of microservice reliability, the implementation of the Chaos Monkey tool, the step‑by‑step methodology, guiding principles, and real‑world outcomes that demonstrate improved system availability.

Chaos MonkeyDistributed SystemsNetflix
0 likes · 6 min read
How Netflix Uses Chaos Engineering to Build Resilient Distributed Systems
FunTester
FunTester
Sep 20, 2024 · Operations

Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends

This article compares chaos engineering and fault testing, outlines fault injection techniques, implementation layers, testing strategies, challenges, and future trends such as automation, AI-driven diagnostics, and cloud‑native integration, providing a comprehensive guide for improving system resilience and reliability.

Cloud NativeOperationschaos engineering
0 likes · 17 min read
Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends
FunTester
FunTester
Sep 19, 2024 · Fundamentals

Software Antifragility: Rethinking Error Handling and Reliability

This paper introduces the concept of software antifragility, drawing on Taleb’s theory to argue that embracing errors through fault tolerance, automatic runtime repair, and fault injection can transform software systems into self‑improving, more robust entities, and discusses implications for development processes and product reliability.

antifragilitychaos engineeringfault tolerance
0 likes · 13 min read
Software Antifragility: Rethinking Error Handling and Reliability
Architecture and Beyond
Architecture and Beyond
Jul 21, 2024 · Operations

Mastering Backend Stability: 7 Essential Practices for High Availability

This comprehensive guide outlines the seven key pillars—operations, high‑availability architecture, capacity governance, change management, risk governance, fault management, and chaos engineering—that together form a systematic approach to building and maintaining a reliable, 24‑hour backend system.

Operationsbackend stabilitycapacity planning
0 likes · 40 min read
Mastering Backend Stability: 7 Essential Practices for High Availability
Tencent Cloud Developer
Tencent Cloud Developer
Jul 17, 2024 · Operations

Combining FMEA and Chaos Engineering to Improve Software Architecture Availability

By integrating the proactive, static risk assessment of Failure Mode and Effects Analysis with the dynamic fault‑injection validation of chaos engineering, the article demonstrates how cloud‑native architectures—illustrated through a Tencent‑based e‑commerce case—can systematically identify, quantify, and mitigate availability risks, leading to continuous, measurable resilience improvements.

AvailabilityFMEARisk analysis
0 likes · 16 min read
Combining FMEA and Chaos Engineering to Improve Software Architecture Availability
DataFunSummit
DataFunSummit
May 19, 2024 · Cloud Native

Design and Implementation of a Cloud‑Native Recommendation System Architecture

This article explains how to design and implement a recommendation system by leveraging a four‑layer cloud‑native stack, covering virtualization, micro‑service migration, service governance, elasticity, cloud‑native business capabilities, and chaos‑engineering‑based stability practices to achieve cost‑effective, high‑performance, and reliable recommendation services.

Cloud NativeMicroservicesVirtualization
0 likes · 10 min read
Design and Implementation of a Cloud‑Native Recommendation System Architecture
Bilibili Tech
Bilibili Tech
Apr 9, 2024 · Operations

BCM – Building and Deploying Bilibili’s Chaos Engineering Platform

At the 2024 GOPS Global Operations Conference, Bilibili senior R&D engineer Gu Lintao will present BCM—Bilibili’s Chaos Engineering Platform—showcasing how its design and capabilities let developers, testers, and SREs safely inject faults, uncover hidden architectural risks, and improve service stability through real‑world drills and systematic reliability engineering.

BilibiliDevOpsReliability
0 likes · 3 min read
BCM – Building and Deploying Bilibili’s Chaos Engineering Platform
FunTester
FunTester
Mar 29, 2024 · Operations

Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes

This article describes how WeChat Pay applied chaos engineering to improve system reliability, detailing the business scenario, challenges of controlling fault injection radius, practical solutions, risk assessment, automation, and the resulting business and tool achievements.

Fault InjectionOperationsWeChat Pay
0 likes · 18 min read
Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes
High Availability Architecture
High Availability Architecture
Mar 21, 2024 · Operations

Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes

To improve the ultra‑high availability of WeChat Pay, the team introduced chaos engineering using multi‑partition isolation, controlled blast radius, automated fault injection, and systematic risk discovery, detailing the design, execution, automation, and results of this reliability‑focused initiative.

Fault InjectionReliabilityWeChat Pay
0 likes · 18 min read
Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2024 · Operations

Chaos Engineering in WeChat Pay: Design, Implementation, and Results

WeChat Pay’s team adopted Netflix‑style chaos engineering, building an automated, YAML‑driven fault‑injection platform that isolates experiments in multi‑zone partitions, enabling over 500 safe experiments in 2021‑2022, uncovering critical bugs across core services while maintaining five‑nine availability and zero production incidents.

AutomationFault InjectionReliability
0 likes · 18 min read
Chaos Engineering in WeChat Pay: Design, Implementation, and Results
DataFunTalk
DataFunTalk
Feb 20, 2024 · Cloud Native

Design and Implementation of a Cloud‑Native Recommendation System Architecture

This article presents a comprehensive overview of how to design and implement a recommendation system using cloud‑native technologies, covering the cloud‑native stack, system architecture, key design considerations such as virtualization, micro‑service migration, service governance, resilience, and stability through chaos engineering.

MicroservicesVirtualizationarchitecture
0 likes · 10 min read
Design and Implementation of a Cloud‑Native Recommendation System Architecture
DeWu Technology
DeWu Technology
Dec 8, 2023 · Operations

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

On November 25, Dewu Technology hosted an SRE Stability Engineering salon in Hangzhou where experts from Alibaba, Tencent, Ant Group and Dewu shared practical insights on C‑end link reliability, Alibaba’s system stability operations, Tencent Game’s cloud‑native SRE practices, and Ant Group’s chaos engineering, concluding with a Q&A and resource distribution.

Cloud NativeOperationsSRE
0 likes · 7 min read
SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services
Bilibili Tech
Bilibili Tech
Nov 24, 2023 · Cloud Native

Chaos Engineering and Fault Injection Practices at Bilibili: Architecture, Implementation, and Automation

Bilibili built a middleware‑based chaos engineering platform that injects faults into Golang microservices via AOP, supporting server‑ and client‑side, database, cache, and queue components, with fine‑grained instance, request, target, and user controls, automated dependency collection, experiment orchestration, and CI integration to boost system reliability.

GoMicroservicesReliability
0 likes · 18 min read
Chaos Engineering and Fault Injection Practices at Bilibili: Architecture, Implementation, and Automation
AntTech
AntTech
Nov 7, 2023 · Operations

ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions

ChaosMeta V0.6.0 introduces DNS and log injection capabilities, lossless fault injection concepts, automated experiment orchestration with atomic tasks, and a roadmap for multi‑cloud support and advanced metrics, aiming to solve the last‑mile challenge of continuous automated chaos experiments in production environments.

Fault InjectionObservabilityautomated experiments
0 likes · 9 min read
ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions
Architects Research Society
Architects Research Society
Oct 3, 2023 · Cloud Native

Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started

Chaos engineering is a disciplined approach to testing distributed systems by intentionally injecting failures to verify resilience, covering its definition, origins at Netflix, operational workflow, benefits, challenges, and practical steps for organizations to adopt resilient cloud‑native applications.

ObservabilityResiliencechaos engineering
0 likes · 18 min read
Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started
DeWu Technology
DeWu Technology
Aug 28, 2023 · Operations

Real-time Data Warehouse Business-Side Chaos Engineering Practice

The article describes how a real‑time data warehouse supporting ad‑delivery metrics adopts both technical and business‑side chaos‑engineering, using red‑blue team drills to inject faults, monitor indicator anomalies, and refine response procedures, thereby enhancing early risk detection, system resilience, and overall data stability for the advertising platform.

Backend DevelopmentData QualityData Warehousing
0 likes · 16 min read
Real-time Data Warehouse Business-Side Chaos Engineering Practice
Huolala Tech
Huolala Tech
Aug 22, 2023 · Operations

How HuoLala Built a Resilient Fault‑Drill Platform to Boost System Reliability

Facing growing microservice complexity, HuoLala designed a comprehensive fault‑drill system—covering management, tooling, and operations—to simulate failures, control blast radius, automate scenarios, and continuously improve resilience, ultimately reducing downtime and enhancing system stability across more than ten business units.

AutomationFault InjectionMicroservices
0 likes · 12 min read
How HuoLala Built a Resilient Fault‑Drill Platform to Boost System Reliability
Tech Architecture Stories
Tech Architecture Stories
Aug 15, 2023 · Cloud Native

Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation

This article explains how measurement (SLI/SLO), governance (architecture refactoring, MTTx), and validation (chaos engineering, disaster drills) interrelate in microservice systems, illustrating how observability drives governance actions, governance improves metrics, and validation reinforces both through continuous testing.

MicroservicesObservabilitySLI
0 likes · 4 min read
Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation
dbaplus Community
dbaplus Community
Jun 20, 2023 · Operations

How Agricultural Bank Built a Chaos Engineering Platform for Resilience

The article outlines the Agricultural Bank of China's initiative to adopt chaos engineering, describing the challenges of modern distributed systems, the design and capabilities of their in‑house chaos platform, product research, industry comparisons, practical use cases across development, operations and disaster recovery, and future development directions.

Cloud NativeDistributed SystemsPlatform Development
0 likes · 14 min read
How Agricultural Bank Built a Chaos Engineering Platform for Resilience
Meituan Technology Team
Meituan Technology Team
May 25, 2023 · Databases

Meituan Database Fault‑Injection and Chaos Engineering Practice

The article details Meituan's large‑scale database fault‑injection platform, explaining its architecture, capabilities, workflow, blast‑radius controls, random unnotified drills, operational metrics, and future plans aligned with a chaos‑engineering maturity model.

Database Fault InjectionLarge‑Scale DatabasesMaturity Model
0 likes · 23 min read
Meituan Database Fault‑Injection and Chaos Engineering Practice
Efficient Ops
Efficient Ops
May 12, 2023 · Operations

Designing an Intelligent Performance Testing Platform: From Vision to Implementation

This article describes how a bank’s IT team transformed its performance testing by defining intelligent platform capabilities, designing a modular architecture, and implementing features such as automated risk identification, smart test case generation, data synthesis, multi‑protocol support, chaos injection, and automated result analysis using JMeter, Prometheus, and custom plugins.

JMeterPerformance Testingchaos engineering
0 likes · 11 min read
Designing an Intelligent Performance Testing Platform: From Vision to Implementation
DevOps
DevOps
May 12, 2023 · Operations

Evolution of Chaos Engineering at Netflix: From Chaos Monkey to ChAP

This article examines how Netflix has progressively refined its chaos engineering practices—from the early Chaos Monkey tool to the sophisticated Chaos Automation Platform (ChAP)—to improve system resilience, automate experiments, and safely validate changes in large‑scale microservice environments.

Fault InjectionNetflixReliability
0 likes · 26 min read
Evolution of Chaos Engineering at Netflix: From Chaos Monkey to ChAP
Efficient Ops
Efficient Ops
Apr 26, 2023 · Operations

Building a Chaos Engineering Platform for Financial Services: Key Lessons

This talk outlines the challenges of maintaining system stability in fast‑moving, cloud‑native financial services, describes a risk‑identification model, high‑fidelity fault simulation, and a comprehensive stability engineering platform, and shares future plans for automated, data‑driven risk mitigation.

Financial ServicesOperationsSRE
0 likes · 15 min read
Building a Chaos Engineering Platform for Financial Services: Key Lessons
JD Tech
JD Tech
Mar 14, 2023 · Operations

Introduction to Chaos Engineering and Its Practical Exercise Workflow

This article offers a comprehensive overview of chaos engineering, explaining its definition, why it is needed, the value it brings, a detailed step‑by‑step practice workflow—including preparation, execution, recovery and review phases—typical drill scenarios, key assessment metrics, and risk‑control measures to improve system reliability and high‑availability.

Fault Injectionchaos engineeringrisk management
0 likes · 11 min read
Introduction to Chaos Engineering and Its Practical Exercise Workflow
FunTester
FunTester
Mar 13, 2023 · Operations

How Chaos Engineering Can Strengthen System Reliability: A Practical Guide

This article explains the origins and principles of chaos engineering, illustrates how fault‑injection scenarios expose system weaknesses, outlines step‑by‑step implementation—from tool selection and metric definition to execution and post‑mortem—and highlights its role in achieving high‑availability service level agreements.

DevOpsDistributed SystemsFault Injection
0 likes · 10 min read
How Chaos Engineering Can Strengthen System Reliability: A Practical Guide
ByteDance SYS Tech
ByteDance SYS Tech
Feb 28, 2023 · Cloud Native

How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

This article explains ByteDance’s end‑to‑end chaos engineering practice for cloud‑native environments, covering its background, principles, comparison with traditional testing, the evolution of its internal platforms, and a detailed look at the Application Resilience Enhancement Service (ARES) and its core features.

Fault InjectionKubernetesMicroservices
0 likes · 17 min read
How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering
Tencent Cloud Developer
Tencent Cloud Developer
Jan 5, 2023 · Cloud Native

QQ Music High-Availability Architecture Overview

QQ Music achieves high availability by layering redundant multi‑datacenter architecture, proactive chaos‑engineering toolchains, and comprehensive observability—including metrics, logging, tracing and profiling—while employing service grading, adaptive retry windows and EMA‑based dynamic timeouts to gracefully handle faults across its massive micro‑service ecosystem.

Distributed SystemsMicroservicesObservability
0 likes · 24 min read
QQ Music High-Availability Architecture Overview
dbaplus Community
dbaplus Community
Nov 28, 2022 · Operations

How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals

Bilibili’s S12 technical guarantee team coordinated dozens of engineering groups, performed resource estimation, built a shared resource pool, applied chaos engineering, high‑availability architecture, and systematic performance testing to ensure the League of Legends World Championship livestream remained stable and responsive under peak traffic.

Performance TestingResource ManagementSRE
0 likes · 19 min read
How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals
Bilibili Tech
Bilibili Tech
Nov 15, 2022 · Operations

Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability

To ensure “tea‑time” reliability for Bilibili’s 2022 S12 League of Legends championship, a cross‑functional technical‑assurance project introduced shared resource pools, CPUSET removal, multi‑instance HA architecture, adaptive throttling, chaos‑engineered fault injection, a new Golang gateway, extensive load testing, and coordinated on‑site duty, delivering uninterrupted live streaming without forced throttling.

SREchaos engineeringhigh availability
0 likes · 20 min read
Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability
Architects Research Society
Architects Research Society
Sep 16, 2022 · Operations

Building a Reliability Culture: Practices, Benefits, and Implementation

This article explains what a reliability culture is, why it matters, how to cultivate it through mission statements, early‑stage reliability testing, chaos‑engineering practices like GameDays and FireDrills, and how organizations can continuously learn from incidents to improve system availability and customer trust.

CultureOperationsReliability
0 likes · 18 min read
Building a Reliability Culture: Practices, Benefits, and Implementation
DevOps
DevOps
Sep 1, 2022 · Operations

Designing Quantifiable Steady‑State Hypotheses to Reduce Chaos Engineering Experiment Costs

The article examines why chaos‑engineering experiments often seem low‑cost‑effective, argues that unclear and unquantified steady‑state hypotheses hinder business value and automation, and proposes concrete, user‑centric, measurable hypotheses and equivalence‑class reasoning to streamline experiments and lower costs.

Cost reductionDevOpschaos engineering
0 likes · 9 min read
Designing Quantifiable Steady‑State Hypotheses to Reduce Chaos Engineering Experiment Costs
ByteDance Cloud Native
ByteDance Cloud Native
Aug 4, 2022 · Operations

Chaos Engineering Boosts Cloud‑Native Stability: Key Findings from China’s 2022 Survey

As cloud computing becomes essential infrastructure, cloud‑native systems gain flexibility but face stability challenges, prompting China’s Academy of Information and Communications Technology to launch a 2022 chaos engineering survey that uncovers vulnerabilities and promotes practical adoption of reliability techniques across the industry.

ChinaCloud Nativechaos engineering
0 likes · 3 min read
Chaos Engineering Boosts Cloud‑Native Stability: Key Findings from China’s 2022 Survey
FunTester
FunTester
Jul 24, 2022 · Operations

Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation

Chaos engineering, a discipline for experimenting on distributed systems, helps teams identify hidden weaknesses, improve high‑availability, and build confidence in production by defining stable states, injecting realistic failures, and measuring impact through observability metrics, with practical steps, tool choices, maturity stages, and evaluation methods.

Distributed SystemsFault InjectionObservability
0 likes · 11 min read
Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation
Efficient Ops
Efficient Ops
Jun 19, 2022 · Operations

How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons

This article details Bilibili's SRE approach to large‑scale live events, covering background, activity scenarios, resource planning, performance testing, chaos‑engineering drills, technical safeguards such as DCDN, SLB, WAF, PaaS, cache and DB, pre‑plan capabilities, post‑mortem analysis, and future outlook, illustrating how systematic capacity management and automated resilience practices enable stable operation for events with tens of millions of concurrent users.

Performance TestingSREcapacity planning
0 likes · 22 min read
How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons
Qunar Tech Salon
Qunar Tech Salon
Jun 16, 2022 · Operations

Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation

This article details Qunar Travel's mature chaos engineering platform built on chaosblade, covering value analysis, system architecture, shutdown and dependency drills, automated closed‑loop testing, attack‑defense exercises, and the measurable reliability improvements achieved across thousands of services.

Distributed SystemsFault InjectionOperations
0 likes · 18 min read
Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation
Bilibili Tech
Bilibili Tech
Jun 14, 2022 · Operations

SRE Practices for Large‑Scale Event Assurance at Bilibili

Bilibili’s SRE team ensures flawless large‑scale online events by meticulously gathering activity details, provisioning DNS, CDN, networking and compute resources, conducting multi‑stage performance tests and chaos‑engineering drills, applying layered traffic controls, maintaining historical checklists, executing predefined contingency responses, and iterating post‑mortems to drive continuous automation and reliability.

Cloud NativeEvent ReliabilityOperations
0 likes · 20 min read
SRE Practices for Large‑Scale Event Assurance at Bilibili
Bilibili Tech
Bilibili Tech
May 13, 2022 · Cloud Native

Chaos Engineering Practices for Bilibili Distributed KV Storage

Peng Liangyou describes how Bilibili’s large‑scale distributed KV storage adopts Netflix‑style chaos engineering—defining steady‑state hypotheses, replicating production environments, injecting CPU, memory, network and replica faults via automated “monkey” experiments, monitoring latency and durability with Prometheus/Grafana, and over 1.5 years preventing critical incidents while cutting testing costs and enabling incremental, standards‑based reliability improvements.

BilibiliFault InjectionKV Store
0 likes · 15 min read
Chaos Engineering Practices for Bilibili Distributed KV Storage
dbaplus Community
dbaplus Community
May 4, 2022 · Operations

How Tencent Game Teams Use Chaos Engineering to Boost Reliability and Reduce Outages

This article explains the concept of chaos engineering, its six key benefits, the design of a full‑lifecycle chaos platform, fault‑atom categories, experiment orchestration, risk control, automation, red‑blue war games, and practical experiments that helped Tencent Games improve system reliability while cutting operational costs.

DevOpsGamingOperations
0 likes · 21 min read
How Tencent Game Teams Use Chaos Engineering to Boost Reliability and Reduce Outages
JD Retail Technology
JD Retail Technology
Apr 27, 2022 · Industry Insights

How JD Achieves Seamless Stability During Massive Sales Events

The article reviews the Global Information System Stability Summit and JD's technical architect Li Junliang's detailed case study on the engineering practices, observability, chaos engineering, and resource‑scheduling innovations that enable JD’s e‑commerce platform to handle sales‑peak traffic that spikes hundreds of times over normal load.

Observabilitychaos engineeringe‑commerce
0 likes · 7 min read
How JD Achieves Seamless Stability During Massive Sales Events
Open Source Linux
Open Source Linux
Mar 8, 2022 · Operations

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

This article breaks down Kubernetes troubleshooting into three essential steps—understanding the failure, managing the response, and preventing recurrence—while mapping key monitoring, observability, and incident‑response tools to each phase for reliable cloud‑native operations.

KubernetesObservabilityOperations
0 likes · 8 min read
Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs
DeWu Technology
DeWu Technology
Feb 28, 2022 · Operations

DeWu Tech Salon – Quality Assurance Sessions Summary

The DeWu Tech Salon, co‑hosted by DeWu App Quality Platform and TesterHome, brought senior engineers from Alibaba Cloud, ByteDance, Lagou and DeWu together to share practical QA insights on end‑side monitoring, traffic replay, full‑link stress testing, and industry‑scale chaos engineering, while announcing a PPT collection, a testing‑expert recruitment drive, and a preview of the next wireless‑technology salon.

Performance Monitoringchaos engineeringsoftware reliability
0 likes · 6 min read
DeWu Tech Salon – Quality Assurance Sessions Summary
Efficient Ops
Efficient Ops
Jan 24, 2022 · Operations

How Qunar Turned Chaos Engineering into Reliable Operations: A Deep Dive

This article explores Qunar's practical implementation of chaos engineering, detailing its value, the four strategic directions, shutdown and application drills, strong‑weak dependency handling, container support, and automated closed‑loop testing that together boost system resilience, process robustness, and user experience.

Automationchaos engineeringsite reliability
0 likes · 20 min read
How Qunar Turned Chaos Engineering into Reliable Operations: A Deep Dive
AntTech
AntTech
Jan 24, 2022 · Operations

Ant Group's Chaos Engineering System: Evolution, Business Features, Key Technologies, and Future Directions

This article outlines Ant Group's six‑year journey in chaos engineering, describing its three generational evolutions, business‑oriented fault injection, risk‑mining, full‑lifecycle coverage, massive scale, root‑data protection, core technologies such as Awatch, simulation environments, and plans for intelligent, open‑source future development.

Ant GroupAwatchFault Injection
0 likes · 23 min read
Ant Group's Chaos Engineering System: Evolution, Business Features, Key Technologies, and Future Directions
Alibaba Cloud Native
Alibaba Cloud Native
Jan 3, 2022 · Cloud Native

How to Build Minute‑Level Hybrid Cloud Disaster Recovery with MSHA Multi‑Active Architecture

This guide walks through a hybrid cloud disaster‑recovery demo for an e‑commerce platform, detailing business background, requirements, challenges, the MSHA‑based active‑active solution, step‑by‑step implementation, and verification of sub‑minute RPO/RTO using Alibaba Cloud services and Chaos engineering.

Active-ActiveAlibaba CloudMSHA
0 likes · 13 min read
How to Build Minute‑Level Hybrid Cloud Disaster Recovery with MSHA Multi‑Active Architecture
Alibaba Cloud Native
Alibaba Cloud Native
Dec 30, 2021 · Operations

How to Implement Chaos Engineering for Cloud‑Native Applications: A Step‑by‑Step Guide

This article explains how cloud‑native teams can adopt chaos engineering—defining its concepts, outlining its unique characteristics, and detailing a four‑stage implementation process from manual drills to production‑level raids, with practical steps, environment setups, and real‑world results.

Cloud NativeFault InjectionKubernetes
0 likes · 14 min read
How to Implement Chaos Engineering for Cloud‑Native Applications: A Step‑by‑Step Guide
DevOps
DevOps
Dec 27, 2021 · Operations

2021 China Chaos Engineering Survey Report: Findings and Recommendations

Based on 1,016 valid questionnaire responses and 17 enterprise interviews, the 2021 China Chaos Engineering Survey Report reveals low software system stability, limited adoption of chaos engineering, its positive impact on availability, and provides data‑driven recommendations for improving stability through mature tools, metrics, and cultural shifts.

Cloud NativeOperationschaos engineering
0 likes · 15 min read
2021 China Chaos Engineering Survey Report: Findings and Recommendations
dbaplus Community
dbaplus Community
Dec 15, 2021 · Operations

How Chaos Engineering Guarantees Stability for Distributed Data Systems

This article examines the stability challenges of selecting distributed data products, introduces chaos‑engineering‑based testing methods, outlines practical test scenarios, fault injection techniques, toolchains, and quantitative analysis metrics, and presents a capability assessment standard for ensuring system reliability.

Data PlatformsReliabilitychaos engineering
0 likes · 11 min read
How Chaos Engineering Guarantees Stability for Distributed Data Systems
GrowingIO Tech Team
GrowingIO Tech Team
Dec 2, 2021 · Cloud Native

Mastering Chaos Mesh: A Hands‑On Guide to Cloud‑Native Chaos Engineering

Chaos Mesh is an open‑source cloud‑native chaos engineering platform that lets you experiment with fault injection across Kubernetes environments, offering visual dashboards, extensive fault types, and step‑by‑step installation and experiment creation guides to help teams uncover system weaknesses and improve resilience.

Chaos MeshFault InjectionKubernetes
0 likes · 12 min read
Mastering Chaos Mesh: A Hands‑On Guide to Cloud‑Native Chaos Engineering
HelloTech
HelloTech
Sep 27, 2021 · Operations

Fault Drills and Chaos Engineering Practices for Enhancing System Stability

The initiative introduces fault‑drill and chaos‑engineering practices—defining steady‑state metrics, injecting real‑world failures in controlled experiments, automating continuous production tests, and limiting blast radius—to detect weaknesses early, accelerate fault location and recovery, boost emergency response metrics, and foster a resilient engineering culture.

AutomationReliabilitychaos engineering
0 likes · 11 min read
Fault Drills and Chaos Engineering Practices for Enhancing System Stability
DevOps
DevOps
Aug 31, 2021 · Backend Development

Designing an Uber‑Like Microservice System with DDD, OpenTelemetry Observability, and Reinforced Chaos Engineering

This article describes how to model a complex Uber‑style ride‑hailing system using Domain‑Driven Design, implement it with Java Spring Boot microservices, instrument it with OpenTelemetry for full observability, and validate the observability pipeline through a gamified chaos‑engineering approach that reduces MTTR.

DDDJavaMicroservices
0 likes · 13 min read
Designing an Uber‑Like Microservice System with DDD, OpenTelemetry Observability, and Reinforced Chaos Engineering
Code Ape Tech Column
Code Ape Tech Column
Aug 28, 2021 · Backend Development

A Curated List of Alibaba Open‑Source Developer Tools for Backend Development

This article introduces a collection of Alibaba‑released open‑source tools—including Arthas, Cloud Toolkit, ChaosBlade, PTS, Druid, and more—detailing their usage scenarios, tutorials, and acquisition methods to help backend developers improve efficiency, debugging, monitoring, and reliability of their services.

AlibabaJavaPerformance Monitoring
0 likes · 14 min read
A Curated List of Alibaba Open‑Source Developer Tools for Backend Development
dbaplus Community
dbaplus Community
Aug 21, 2021 · Operations

How ByteDance Scales Chaos Engineering with Scenario‑Driven Proactive Experiments

This article explains ByteDance's journey from basic fault‑injection testing to a production‑grade, scenario‑driven proactive chaos engineering platform that automates experiments, defines stability metrics, controls blast radius, and continuously validates service dependencies to improve system resilience.

MicroservicesScenario Testingchaos engineering
0 likes · 21 min read
How ByteDance Scales Chaos Engineering with Scenario‑Driven Proactive Experiments
TAL Education Technology
TAL Education Technology
Aug 19, 2021 · Operations

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

This document outlines a comprehensive SRE‑driven operational framework for ensuring stable, high‑availability online education services during peak summer and winter periods, detailing pre‑, during‑, and post‑maintenance phases, architectural principles, load testing, monitoring, capacity management, safety hardening, chaos engineering, incident response, and post‑mortem practices.

Load TestingSREcapacity planning
0 likes · 17 min read
Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform
Alibaba Cloud Native
Alibaba Cloud Native
Aug 12, 2021 · Cloud Native

How ChaosBlade’s Unified Experiment Model Boosts Cloud‑Native Resilience

This article explains the design, model, and practical usage of Alibaba's open‑source ChaosBlade and its platform chaosblade‑box, detailing how a unified chaos experiment model enables scalable, multi‑environment fault injection for cloud‑native systems and improves high‑availability testing.

ChaosBladeCloud NativeKubernetes
0 likes · 15 min read
How ChaosBlade’s Unified Experiment Model Boosts Cloud‑Native Resilience
DevOps
DevOps
Aug 11, 2021 · Operations

Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems

This article explains that chaos engineering is not a magic cure but a disciplined practice for testing distributed systems by designing and running controlled experiments, outlining four essential steps—observability, defining steady state, hypothesizing events, and executing experiments—to gain confidence in system resilience.

ObservabilityOperationschaos engineering
0 likes · 11 min read
Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems
Alibaba Cloud Native
Alibaba Cloud Native
Aug 6, 2021 · Operations

Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices

Qunar shares how it built a large‑scale chaos engineering platform for thousands of microservices, detailing tool selection, architecture, evolution stages, fault‑injection scenarios, strong/weak dependency automation, open‑source contributions, and future plans for automated random drills.

Cloud NativeFault InjectionOperations
0 likes · 9 min read
Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices
DevOps
DevOps
Jul 12, 2021 · Operations

The First Four Chaos Experiments to Run on Apache Kafka

This article explains how to use chaos engineering with Gremlin to design, execute, and analyze four experiments that test Kafka broker load, message loss, split‑brain scenarios, and ZooKeeper outages, helping improve the reliability and resilience of Kafka deployments.

Distributed SystemsGremlinKafka
0 likes · 18 min read
The First Four Chaos Experiments to Run on Apache Kafka
Baidu Geek Talk
Baidu Geek Talk
Jul 5, 2021 · Operations

Automated and Intelligent Analysis of Baidu Search Stability Issues

The team automated Baidu Search fault diagnosis by building a side‑index for instant log lookup, streaming incremental analysis, exhaustive rule templates, feature‑engineering pipelines, query‑scene reconstruction, entropy‑based ranking, per‑second timeline views, and chaos‑engineered fault injection, achieving near‑99% accuracy and second‑level, module‑granular stability tracing.

ObservabilitySearch Stabilitychaos engineering
0 likes · 15 min read
Automated and Intelligent Analysis of Baidu Search Stability Issues
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 2, 2021 · Cloud Native

Chaos Engineering Practices at iQIYI: Building Resilient Cloud‑Native Systems

iQIYI’s Little Deer Chaos Platform injects faults and runs red‑blue attacks across production services, enabling teams to validate alerts, circuit‑breakers, and fail‑over mechanisms—demonstrated by video playback and membership service case studies—thereby fostering zero‑trust design, faster skill growth, and resilient cloud‑native operations.

DevOpsFault InjectionReliability
0 likes · 10 min read
Chaos Engineering Practices at iQIYI: Building Resilient Cloud‑Native Systems
DevOps
DevOps
Jun 2, 2021 · Operations

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack’s resilience engineering team outlines a structured chaos‑engineering workflow—identifying potential failures, ensuring fault tolerance, and deliberately injecting faults in development and production—to safely test system robustness, validate hypotheses, and continuously improve reliability through regular disaster‑theater exercises.

Fault InjectionOperationsReliability
0 likes · 11 min read
Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering