Tagged articles
16 articles
Page 1 of 1
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 29, 2025 · Operations

How to Prevent Avalanche Failures in Large‑Scale Microservice Systems

This article explains how Baidu's SRE team identified the root causes of avalanche failures in massive microservice architectures, modeled system limits with Little’s Law, and implemented engineering practices such as retry budgets, queue throttling, and global TTL controls to achieve self‑healing and eliminate avalanche incidents.

MicroservicesSREavalanche failure
0 likes · 9 min read
How to Prevent Avalanche Failures in Large‑Scale Microservice Systems
Architect-Kip
Architect-Kip
Oct 28, 2025 · Operations

Mastering Failure Recovery: Fast‑Fail, Auto‑Retry, and Resilience Patterns for Distributed Systems

This guide outlines core principles and practical solutions for building resilient backend systems, covering fast‑failure handling, automatic retries with exponential back‑off, circuit‑breaker usage, idempotency, batch job strategies, online transaction patterns, and robust message‑queue processing.

Batch ProcessingIdempotencyMessage Queue
0 likes · 17 min read
Mastering Failure Recovery: Fast‑Fail, Auto‑Retry, and Resilience Patterns for Distributed Systems
FunTester
FunTester
Jul 2, 2025 · Operations

How Leading Chinese Companies Harness Chaos Engineering to Boost System Resilience

Chinese enterprises such as Alibaba, JD Cloud, and Xiaomi are increasingly adopting chaos engineering tools like ChaosBlade and Chaos Mesh to simulate failures in production-like environments, overcoming challenges of awareness, risk control, talent gaps, and platform integration, while AI and cloud‑native technologies drive smarter, automated resilience testing.

AICloud Nativechaos engineering
0 likes · 3 min read
How Leading Chinese Companies Harness Chaos Engineering to Boost System Resilience
FunTester
FunTester
May 19, 2025 · Operations

Chaos Engineering Tools, Theory, and Practices

Chaos engineering, a scientific method for improving system resilience, is explored through an overview of leading tools such as Gremlin, ChaosBlade, Chaos Mesh, Chaos Toolkit, and ChaosMeta, alongside core concepts, real-world case studies, common misconceptions, and the practical value of controlled fault injection in distributed systems.

Distributed SystemsFault InjectionReliability
0 likes · 12 min read
Chaos Engineering Tools, Theory, and Practices
JD Tech
JD Tech
Apr 17, 2025 · Operations

Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration

This article explains chaos engineering—its definition, core principles, experimental workflow, tool selection, AI‑driven enhancements, and practical case studies—providing a comprehensive guide for building resilient distributed systems across backend, cloud‑native, mobile, and AI‑enabled environments.

AI integrationDistributed SystemsFault Injection
0 likes · 26 min read
Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration
FunTester
FunTester
Mar 14, 2025 · Operations

Fault Testing: Enhancing System Resilience through Controlled Failure Simulations

The article explains how fault testing—by deliberately injecting failures in a controlled environment—helps identify system weaknesses, validates post‑mortem improvements, and drives architectural optimization, thereby increasing high‑availability and resilience of modern internet services.

Operationschaos engineeringfault testing
0 likes · 8 min read
Fault Testing: Enhancing System Resilience through Controlled Failure Simulations
FunTester
FunTester
Mar 12, 2025 · Operations

Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices

Fault injection testing deliberately introduces failures into a system to assess its resilience, helping identify weak points, improve retry and timeout mechanisms, and ensure robust operation across software, protocol, and infrastructure layers, with practical guidance on processes, tools, and Kubernetes-specific practices.

Fault InjectionKubernetesOperations
0 likes · 8 min read
Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices
FunTester
FunTester
Jan 27, 2025 · Operations

Mastering Chaos Engineering: Build Resilient Systems with Proven Practices

In today's always‑on digital era, this article explains chaos engineering concepts, step‑by‑step experimental methods, best‑practice guidelines, and a comparison of leading fault‑injection tools to help organizations proactively strengthen system resilience and reduce downtime risk.

Cloud NativeDevOpsFault Injection
0 likes · 11 min read
Mastering Chaos Engineering: Build Resilient Systems with Proven Practices
FunTester
FunTester
Sep 20, 2024 · Operations

Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends

This article compares chaos engineering and fault testing, outlines fault injection techniques, implementation layers, testing strategies, challenges, and future trends such as automation, AI-driven diagnostics, and cloud‑native integration, providing a comprehensive guide for improving system resilience and reliability.

Cloud NativeOperationschaos engineering
0 likes · 17 min read
Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends
Huolala Tech
Huolala Tech
Aug 22, 2023 · Operations

How HuoLala Built a Resilient Fault‑Drill Platform to Boost System Reliability

Facing growing microservice complexity, HuoLala designed a comprehensive fault‑drill system—covering management, tooling, and operations—to simulate failures, control blast radius, automate scenarios, and continuously improve resilience, ultimately reducing downtime and enhancing system stability across more than ten business units.

Fault InjectionMicroservicesOperations
0 likes · 12 min read
How HuoLala Built a Resilient Fault‑Drill Platform to Boost System Reliability
dbaplus Community
dbaplus Community
May 21, 2023 · Operations

Mastering Rate Limiting, Degradation, and Circuit Breaking for Resilient Microservices

This article explains the concepts of rate limiting, degradation, and circuit breaking in microservice architectures, illustrating passive and active throttling strategies, practical examples of async conversion, various degradation techniques, and circuit‑breaker mechanisms with real‑world tools like Sentinel and Hystrix.

Circuit BreakingMicroservicesdegradation
0 likes · 11 min read
Mastering Rate Limiting, Degradation, and Circuit Breaking for Resilient Microservices
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Sep 14, 2021 · Operations

Understanding Rate Limiting, Degradation, and Circuit Breaking in Distributed Systems

This article explains the concepts of rate limiting, service degradation, and circuit breaking, illustrating passive and active throttling strategies, asynchronous processing, and practical examples such as Alibaba Sentinel, token‑based controls, and Hystrix, to help engineers design resilient, high‑availability systems.

Circuit Breakingrate limitingservice degradation
0 likes · 11 min read
Understanding Rate Limiting, Degradation, and Circuit Breaking in Distributed Systems
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 11, 2020 · Cloud Native

Chaos Engineering Framework and Practices in iQIYI FinTech Team

The iQIYI FinTech team implemented a Chaos Engineering framework, using a purpose‑driven Chaos Monkey to inject controlled failures, validate high‑availability, isolation, and self‑healing of payment services, derive architectural improvements, build a fault‑case library, and transition from fault detection to proactive system robustness.

Chaos MonkeyDistributed SystemsFinTech
0 likes · 9 min read
Chaos Engineering Framework and Practices in iQIYI FinTech Team
Programmer DD
Programmer DD
Mar 23, 2020 · Operations

Mastering Chaos Engineering: Boost Confidence in Distributed Systems

This article explains chaos engineering as a systematic approach to experiment on distributed systems, identifies common failure modes, outlines a four‑step experimentation process, and presents advanced principles to help teams increase reliability and confidence in production environments.

Distributed SystemsReliabilitychaos engineering
0 likes · 7 min read
Mastering Chaos Engineering: Boost Confidence in Distributed Systems
Alibaba Cloud Native
Alibaba Cloud Native
May 23, 2019 · Operations

Why Chaos Engineering Is Essential for Modern Distributed Systems

This article explains the meaning, benefits, and practical implementation of chaos engineering, compares it with traditional fault injection, discusses when it’s needed, and details Alibaba’s multi‑year experience and its open‑source ChaosBlade tool for building resilient cloud‑native systems.

ChaosBladeCloud Nativesystem resilience
0 likes · 12 min read
Why Chaos Engineering Is Essential for Modern Distributed Systems
Youzan Coder
Youzan Coder
Jun 22, 2018 · Operations

Chaos Engineering: Definition, Principles, and Implementation Steps

Chaos engineering is a disciplined practice that injects controlled faults into distributed systems—often in production—to validate steady-state hypotheses, uncover hidden reliability weaknesses, and continuously improve resilience, as illustrated by the staged implementations and fault-injection techniques used by companies such as JD.com, Youzan, and Netflix.

Fault InjectionReliabilitychaos engineering
0 likes · 11 min read
Chaos Engineering: Definition, Principles, and Implementation Steps