Tagged articles
139 articles
Page 2 of 2
Efficient Ops
Efficient Ops
Jun 1, 2021 · Operations

Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops

This article details how a major securities firm analyzed business stability, built a comprehensive stability engineering platform using chaos engineering, practiced extensive fault‑injection drills, and outlines future directions such as random‑scenario exercises, red‑blue battles, and AI‑driven risk detection.

Operationschaos engineeringfinancial systems
0 likes · 11 min read
Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops
Volcano Engine Developer Services
Volcano Engine Developer Services
May 24, 2021 · Operations

How ByteDance Scales High Availability with Chaos Engineering: From Platform 1.0 to 2.0

This article details ByteDance's evolution of chaos engineering platforms and high‑availability practices, covering service types, architectural upgrades, fault‑center design, explosion‑radius control, steady‑state algorithms, automated experiments, and future plans for resilient infrastructure.

Kubernetesautomationchaos engineering
0 likes · 17 min read
How ByteDance Scales High Availability with Chaos Engineering: From Platform 1.0 to 2.0
DevOps
DevOps
May 17, 2021 · Cloud Native

Challenges of Testing Cloud‑Native Applications and the Need for New Approaches

Amid accelerating Agile and DevOps adoption, the rapid delivery of cloud‑native microservices introduces cascading risks and makes traditional monolithic testing inadequate, prompting a shift toward observability‑driven “right‑shift” testing, exploratory methods, and chaos engineering to embrace failure as the new normal.

DevOpsMicroserviceschaos engineering
0 likes · 8 min read
Challenges of Testing Cloud‑Native Applications and the Need for New Approaches
DevOps
DevOps
May 6, 2021 · Cloud Native

Testing Strategies for Cloud‑Native Applications

The article explains how traditional testing falls short for cloud‑native, microservice‑based applications and outlines modern strategies—including unit, integration, contract, non‑functional, chaos engineering, and observability techniques—to ensure quality, resilience, and rapid delivery in dynamic cloud environments.

Microserviceschaos engineeringcloud-native
0 likes · 11 min read
Testing Strategies for Cloud‑Native Applications
Alibaba Cloud Native
Alibaba Cloud Native
May 4, 2021 · Cloud Native

Exploring ChaosBlade: Alibaba’s Open‑Source Chaos Engineering Platform for Cloud‑Native Environments

ChaosBlade, Alibaba’s open‑source chaos engineering project now advancing through CNCF Sandbox, offers a comprehensive suite—including the chaosblade experiment tool and chaosblade‑box platform—to simulate over 200 scenarios across hosts, Kubernetes, and multi‑language applications, with automated deployment, extensible architecture, and enterprise adoption examples.

CNCFCloud NativeKubernetes
0 likes · 6 min read
Exploring ChaosBlade: Alibaba’s Open‑Source Chaos Engineering Platform for Cloud‑Native Environments
Top Architect
Top Architect
Mar 26, 2021 · Operations

Top Open‑Source Projects for SREs and DevOps

This article presents a curated list of popular open‑source tools for monitoring, deployment, chaos testing, and reliability engineering, explaining their main features and how they help SREs and DevOps engineers build scalable, highly available cloud‑native systems.

Cloud NativeDevOpsSRE
0 likes · 10 min read
Top Open‑Source Projects for SREs and DevOps
Alibaba Cloud Native
Alibaba Cloud Native
Mar 25, 2021 · Cloud Native

How ChaosBlade‑Box Empowers Cloud‑Native High Availability with Chaos Engineering

The article introduces ChaosBlade‑Box, an open‑source cloud‑native chaos‑engineering console that builds on Alibaba’s ChaosBlade tool, explains the high‑availability challenges of cloud‑native systems, details the platform’s design, features, multi‑language support, deployment workflow, example experiments, and future roadmap for resilient architectures.

Cloud NativeKuberneteschaos engineering
0 likes · 12 min read
How ChaosBlade‑Box Empowers Cloud‑Native High Availability with Chaos Engineering
Baidu Geek Talk
Baidu Geek Talk
Feb 8, 2021 · Cloud Native

Baidu Testing Middleware: Architecture, Design Principles, and Application Scenarios

Baidu Testing Middleware is an Envoy‑based sidecar proxy that combines a data‑plane and control‑plane to intercept, inspect, modify, and route traffic, providing recording, replay, fault injection and rate‑limiting capabilities that support functional, system, integration, sandbox and chaos testing at massive scale.

BaiduControl PlaneData Plane
0 likes · 20 min read
Baidu Testing Middleware: Architecture, Design Principles, and Application Scenarios
Baidu Intelligent Testing
Baidu Intelligent Testing
Jan 20, 2021 · Operations

Baidu Test Middleware: Architecture, Design Principles, and Application Scenarios

The article introduces Baidu's self‑developed test middleware, explains its data‑plane and control‑plane architecture inspired by Istio, describes the challenges of building test environments, and details the system’s components, working modes, and diverse testing use cases such as functional, system, integration, sandbox, and chaos engineering.

Performance TestingService Meshchaos engineering
0 likes · 20 min read
Baidu Test Middleware: Architecture, Design Principles, and Application Scenarios
Alibaba Terminal Technology
Alibaba Terminal Technology
Jan 7, 2021 · Frontend Development

How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba

This article explores the challenges of frontend availability, introduces chaos engineering concepts, and details Alibaba's practical approach to frontend fault injection—including static resource hijacking, a safe isolated environment, monitoring integration, and a real‑world drill that demonstrates how to measure and improve detection and response capabilities.

Fault InjectionReliabilitychaos engineering
0 likes · 18 min read
How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba
Tencent Cloud Developer
Tencent Cloud Developer
Dec 25, 2020 · Operations

Tencent Cloud Network Operations Platform: Architecture, Chaos Engineering, Change Health Check, and Monitoring

Tencent Cloud’s network operations platform combines a layered underlay‑overlay architecture, rapid fault detection within seconds and recovery in minutes, chaos‑engineering experiments, rigorous change health checks, high‑frequency multi‑path monitoring, and plans for predictive self‑healing to ensure reliable service across millions of servers.

Network MonitoringTencent Cloudchange management
0 likes · 14 min read
Tencent Cloud Network Operations Platform: Architecture, Chaos Engineering, Change Health Check, and Monitoring
Alibaba Cloud Native
Alibaba Cloud Native
Dec 21, 2020 · Operations

How to Build Multi‑Site High Availability with AHAS‑MSHA: Real‑World E‑Commerce Cases

This article explains the challenges of achieving high availability in unreliable environments, introduces disaster‑tolerance concepts and RPO/RTO metrics, describes Alibaba Cloud's AHAS‑MSHA multi‑site solution and its key features, and walks through two e‑commerce case studies that demonstrate implementation steps, fault‑injection drills, and recovery verification.

AHASMSHAMulti‑Site
0 likes · 14 min read
How to Build Multi‑Site High Availability with AHAS‑MSHA: Real‑World E‑Commerce Cases
DevOps
DevOps
Oct 20, 2020 · Cloud Computing

Chaos Monkey and the Simian Army: Building Resilient Cloud Systems

The article explains how Netflix uses Chaos Monkey and a suite of related tools, collectively called the Simian Army, to deliberately inject failures into their cloud infrastructure, continuously test fault‑tolerance, and ensure high availability and reliability for their streaming service.

NetflixOperationsSimian Army
0 likes · 7 min read
Chaos Monkey and the Simian Army: Building Resilient Cloud Systems
Alibaba Cloud Native
Alibaba Cloud Native
Sep 21, 2020 · Operations

Why Chaos Engineering Is Essential for Cloud‑Native High Availability

This article explains the need for chaos engineering in modern distributed and cloud‑native systems, outlines the challenges faced by architects, developers, testers and product teams, and provides step‑by‑step guidance on using ChaosBlade and Alibaba's AHAS platform for effective fault‑injection experiments.

Cloud NativeOperationschaos engineering
0 likes · 9 min read
Why Chaos Engineering Is Essential for Cloud‑Native High Availability
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 11, 2020 · Cloud Native

Chaos Engineering Framework and Practices in iQIYI FinTech Team

The iQIYI FinTech team implemented a Chaos Engineering framework, using a purpose‑driven Chaos Monkey to inject controlled failures, validate high‑availability, isolation, and self‑healing of payment services, derive architectural improvements, build a fault‑case library, and transition from fault detection to proactive system robustness.

Chaos MonkeyDistributed SystemsFinTech
0 likes · 9 min read
Chaos Engineering Framework and Practices in iQIYI FinTech Team
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Aug 28, 2020 · Cloud Native

Understanding Cloud Native: Service Mesh, Chaos Engineering, and User‑Space Container Networking with eBPF/XDP

This article explains the fundamentals of cloud native computing, introduces service mesh architectures such as Istio and Envoy, explores chaos engineering with Chaos Mesh, and details how eBPF/XDP‑based user‑space container networking can accelerate data‑plane performance in modern microservice environments.

Cloud NativeEnvoyIstio
0 likes · 12 min read
Understanding Cloud Native: Service Mesh, Chaos Engineering, and User‑Space Container Networking with eBPF/XDP
DevOps
DevOps
Aug 13, 2020 · Operations

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

This article outlines ByteDance’s adoption of chaos engineering, describing its background, industry examples, the evolution of internal fault‑injection platforms across three generations, the fault model and center design, experiment principles, and future plans for infrastructure‑level chaos and automated diagnostics.

Distributed SystemsFault InjectionReliability
0 likes · 21 min read
ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions
DataFunTalk
DataFunTalk
Apr 27, 2020 · Operations

ByteDance’s Chaos Engineering Practice and Platform Evolution

This article describes ByteDance’s multi‑generation chaos engineering practice, covering industry background, fault‑injection models, the design of a declarative fault‑center, experiment selection principles, detailed experiment processes, metric classifications, red‑blue war‑game workflows, strong/weak dependency analysis, and future directions for infrastructure‑level chaos engineering.

Fault InjectionReliabilitychaos engineering
0 likes · 21 min read
ByteDance’s Chaos Engineering Practice and Platform Evolution
Programmer DD
Programmer DD
Mar 23, 2020 · Operations

Mastering Chaos Engineering: Boost Confidence in Distributed Systems

This article explains chaos engineering as a systematic approach to experiment on distributed systems, identifies common failure modes, outlines a four‑step experimentation process, and presents advanced principles to help teams increase reliability and confidence in production environments.

Distributed SystemsReliabilitychaos engineering
0 likes · 7 min read
Mastering Chaos Engineering: Boost Confidence in Distributed Systems
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 18, 2020 · Cloud Native

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

This article explains why online applications experience crashes during traffic spikes, outlines the complexity of modern cloud‑based service architectures, and shares Alibaba engineers’ practical notes on high‑availability design, capacity planning, full‑link stress testing, monitoring, traffic control, routine inspections, and chaos‑engineering drills using tools such as AHAS, PTS, Sentinel and Advisor.

Alibaba Cloudcapacity planningchaos engineering
0 likes · 12 min read
Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook
Ctrip Technology
Ctrip Technology
Nov 14, 2019 · Operations

Chaos Engineering: Principles, Practices, and Lessons from Ctrip

The article explains Chaos Engineering as a discipline for deliberately injecting failures into distributed systems to uncover hidden weaknesses, outlines its five core principles, describes practical implementation steps and real‑world examples from Ctrip, and discusses future directions for reliability engineering.

Distributed SystemsFault InjectionOperations
0 likes · 9 min read
Chaos Engineering: Principles, Practices, and Lessons from Ctrip
Alibaba Cloud Native
Alibaba Cloud Native
Nov 5, 2019 · Cloud Native

Master Cloud‑Native Chaos Testing with Alibaba’s ChaosBlade: A Hands‑On Guide

This article introduces Alibaba's open‑source ChaosBlade tool, explains its experiment model and supported scenarios, shows how to install the ChaosBlade Operator on Kubernetes, and provides step‑by‑step instructions for creating, modifying, and cleaning up cloud‑native chaos experiments using both YAML resources and the blade CLI.

ChaosBladeDevOpsKubernetes
0 likes · 10 min read
Master Cloud‑Native Chaos Testing with Alibaba’s ChaosBlade: A Hands‑On Guide
DevOps
DevOps
Sep 16, 2019 · Operations

Netflix Chaos Engineering: Background, Evolution, Tools, Principles, and Practice

This article presents a comprehensive overview of Netflix's chaos engineering journey, detailing its origins, the development of the Simian Army tools, core principles, practical steps, and applications in Kubernetes environments, offering valuable insights for reliable DevOps practices.

DevOpsKubernetesNetflix
0 likes · 10 min read
Netflix Chaos Engineering: Background, Evolution, Tools, Principles, and Practice
Programmer DD
Programmer DD
Aug 4, 2019 · Operations

Simulating CPU and I/O Failures with Bash Scripts for Chaos Engineering

This article demonstrates how to create Bash scripts that fully saturate CPU and I/O resources, explains their role in fault injection within the Simian Army framework, and introduces the broader concepts and benefits of chaos engineering for building resilient distributed systems.

Distributed SystemsFault Injectionbash scripts
0 likes · 9 min read
Simulating CPU and I/O Failures with Bash Scripts for Chaos Engineering
High Availability Architecture
High Availability Architecture
Jul 5, 2019 · Operations

Practices of Chaos Engineering in Distributed Service Architecture

This article presents a comprehensive overview of chaos engineering, covering its definition, value, principles, implementation steps, enterprise adoption strategies, the open‑source ChaosBlade tool and AHAS Chaos platform, and two detailed case studies demonstrating fault injection experiments in a distributed service environment.

AHASAlibabaFault Injection
0 likes · 15 min read
Practices of Chaos Engineering in Distributed Service Architecture
Youzan Coder
Youzan Coder
Jun 21, 2019 · Operations

YouZan Middleware Testing Team: Quality Assurance System and Testing Efficiency Practices

YouZan’s middleware testing team, divided into six specialized groups, employs a left‑shift quality assurance system spanning requirements, development, testing, release, and go‑live phases—leveraging over 10,000 automated test cases, a bus‑based release framework, chaos engineering, continuous delivery, and comprehensive monitoring tools to ensure resilient, high‑quality services.

Continuous DeliveryMiddleware Testingchaos engineering
0 likes · 10 min read
YouZan Middleware Testing Team: Quality Assurance System and Testing Efficiency Practices
Java Captain
Java Captain
May 14, 2019 · Backend Development

A Curated List of Alibaba Open‑Source Developer Tools for Backend Engineers

This article introduces a curated selection of Alibaba’s open‑source and cloud‑based developer tools—including Arthas, Cloud Toolkit, ChaosBlade, ARMS, Docsite, Freeline, EasyExcel, Druid, Dragonwell and more—detailing their use cases, tutorials, and acquisition methods to help developers improve efficiency and code quality.

AlibabaJavaPerformance Monitoring
0 likes · 14 min read
A Curated List of Alibaba Open‑Source Developer Tools for Backend Engineers
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 28, 2019 · Operations

How ChaosBlade Empowers You to Build Resilient Cloud‑Native Systems

ChaosBlade is an open‑source chaos engineering tool from Alibaba that lets you repeatedly inject failures into distributed systems, helping you measure fault tolerance, validate orchestration, test platform robustness, verify monitoring alerts, and improve emergency response capabilities for more reliable cloud‑native applications.

DevOpsDistributed SystemsResilience
0 likes · 9 min read
How ChaosBlade Empowers You to Build Resilient Cloud‑Native Systems
Alibaba Cloud Native
Alibaba Cloud Native
Mar 12, 2019 · Cloud Native

Injecting Real-World Failures into Ali Kubernetes with Open‑Source Monkey Tools

This article explains how chaos engineering principles are applied to Ali Kubernetes by reviewing open‑source Kubernetes monkey tools, analyzing complex failure scenarios, and presenting a custom fault‑injection suite built on the internal MonkeyKing platform to enable flexible, scenario‑driven chaos experiments.

Fault InjectionKubernetesMonkey Tools
0 likes · 10 min read
Injecting Real-World Failures into Ali Kubernetes with Open‑Source Monkey Tools
Java Backend Technology
Java Backend Technology
Mar 2, 2019 · Operations

How Alibaba’s ‘MonkeyKing’ Uses Chaos Engineering to Strengthen System Reliability

Alibaba’s MonkeyKing, inspired by Netflix’s Chaos Monkey, employs intentional fault injection—from random node kills to simulated network outages—to test and improve system robustness across IaaS, PaaS, and SaaS layers, offering a comprehensive model for reliability engineering in complex distributed environments.

AlibabaDistributed SystemsFault Injection
0 likes · 8 min read
How Alibaba’s ‘MonkeyKing’ Uses Chaos Engineering to Strengthen System Reliability
Youzan Coder
Youzan Coder
Jun 22, 2018 · Operations

Chaos Engineering: Definition, Principles, and Implementation Steps

Chaos engineering is a disciplined practice that injects controlled faults into distributed systems—often in production—to validate steady-state hypotheses, uncover hidden reliability weaknesses, and continuously improve resilience, as illustrated by the staged implementations and fault-injection techniques used by companies such as JD.com, Youzan, and Netflix.

Fault InjectionReliabilitychaos engineering
0 likes · 11 min read
Chaos Engineering: Definition, Principles, and Implementation Steps
DevOps
DevOps
May 13, 2018 · Operations

Anti‑Fragility in Software Development: Insights from Jez Humble, Phoenix Server, GameDays and Organizational Culture

The article explores how anti‑fragility principles—drawn from Nassim Taleb’s theory, Jez Humble’s training, Phoenix Server, Netflix’s Chaos Monkey, Amazon GameDays, and Etsy’s blameless post‑mortems—can be applied to software engineering to turn system failures into opportunities for growth and stronger organizational culture.

anti-fragilitychaos engineeringgame day
0 likes · 11 min read
Anti‑Fragility in Software Development: Insights from Jez Humble, Phoenix Server, GameDays and Organizational Culture
DevOpsClub
DevOpsClub
May 11, 2018 · Operations

How Anti‑Fragility and GameDays Turn System Failures into Growth

This article explores anti‑fragility theory and real‑world DevOps practices such as Phoenix Server, Chaos Monkey, GameDays, and blameless post‑mortems, showing how organizations can transform inevitable failures into opportunities for resilience and continuous improvement.

Blameless CultureOperationsanti-fragility
0 likes · 11 min read
How Anti‑Fragility and GameDays Turn System Failures into Growth
Architecture Digest
Architecture Digest
Jul 16, 2017 · Operations

Fault Governance in Distributed Systems: Dependency Failures, Strong/Weak Dependency, and Fault‑Injection Practices

This article presents a comprehensive overview of fault governance in large‑scale distributed systems, covering classic dependency failures, the concept of strong and weak dependencies, experimental observations, the evolution of fault‑injection techniques, and best practices for building reliable fault‑drill platforms.

Distributed SystemsOperationschaos engineering
0 likes · 20 min read
Fault Governance in Distributed Systems: Dependency Failures, Strong/Weak Dependency, and Fault‑Injection Practices
Alibaba Cloud Developer
Alibaba Cloud Developer
May 12, 2017 · Operations

How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce

This article recounts Alibaba's middleware team's QCon Beijing 2017 presentation on fault governance and fault‑drill practices, covering distributed‑system dependency failures, strong/weak dependency concepts, multi‑stage technical evolution, and the design of their chaos‑engineering platform for large‑scale e‑commerce.

AlibabaOperationschaos engineering
0 likes · 21 min read
How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce
dbaplus Community
dbaplus Community
Aug 5, 2016 · Backend Development

How to Build, Test, and Scale Microservices – From Service Discovery to Chaos Engineering

This article walks through designing a microservice application with Go, covering service registration and discovery, various testing strategies—including unit, integration, contract, and chaos testing—illustrates typical architectural patterns, and shares Qiniu's real‑world microservice implementation and practical Q&A.

Gochaos engineeringservice discovery
0 likes · 21 min read
How to Build, Test, and Scale Microservices – From Service Discovery to Chaos Engineering