How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering
This article explains ByteDance’s end‑to‑end chaos engineering practice for cloud‑native environments, covering its background, principles, comparison with traditional testing, the evolution of its internal platforms, and a detailed look at the Application Resilience Enhancement Service (ARES) and its core features.
Chaos Engineering Introduction
ByteDance’s chaos engineering practice in cloud‑native scenarios is presented, including background, development history, and the Application Resilience Enhancement Service (ARES) product.
What Is Chaos Engineering
Chaos engineering is a methodology that conducts experiments on system infrastructure to proactively discover fragile points, enabling developers to fix issues and build resilient architectures.
Chaos Engineering vs. Traditional Testing
Traditional testing (unit, integration, system) covers application‑level concerns but cannot address complex fault scenarios such as network latency or server failures. Chaos engineering complements this by injecting faults to reveal hidden risks.
Chaos Engineering vs. Fault Injection
Fault Injection : Validates predefined faults one by one, limited by exhaustive enumeration.
Chaos Engineering : Exploratory approach that discovers unknown failures by deliberately disrupting services.
ByteDance’s Chaos Engineering Journey
Disaster Recovery Platform (2016) : Built an internal platform for fault injection and basic metric analysis.
Chaos Engineering Platform 2.0 (2019) : Simplified fault injection, added extensible models, automated metric analysis, and supported strong/weak dependency analysis.
Cloud‑Native Product – ARES : Delivered a ToB high‑availability service for cloud‑native environments.
Application Resilience Enhancement Service (ARES)
ARES follows chaos engineering principles, offering rich fault scenarios to improve fault tolerance and recoverability of distributed systems.
Exercise Workflow
Prepare experiment (plan, goals, scenarios).
Orchestrate experiment (services, tasks, schedule).
Start experiment.
Execute fault injection and collect metrics.
Analyze results.
Optimize system to achieve resilience.
Core Features
Key capabilities include experiment configuration, workflow orchestration, fault observation, reporting, high‑availability drills, personal workbench, multi‑cluster execution, multi‑scenario support, WebShell management, extensive fault types, flexible scope selection, process orchestration, fault monitoring, causal analysis using Bayesian time‑series models, observability integration, steady‑state hypothesis, permission management, fault plugins, and dashboards.
Multi‑Cluster and Multi‑Scenario Support
Supports simultaneous experiments across multiple Kubernetes clusters and also on physical or virtual machines, allowing precise control of fault blast radius and selection modes (all, random, fixed count, percentage).
WebShell Management
Provides a web‑based shell to access pods or nodes, view logs, and verify fault effectiveness.
Fault Types
Supports network, pod, system, host, DNS, Kubernetes, process, API, Java, Python, Golang, middleware, and custom faults, with continuous expansion.
Observability and Causal Analysis
Faults are verified through dedicated metrics; AI‑driven causal inference compares observed data with predicted counterfactuals to determine fault impact.
Steady‑State Hypothesis
Defines measurable steady‑state metrics (e.g., QPS, CPU) and uses operators to evaluate deviations during experiments.
Permission Management
Implements role‑based access control for resources, permissions, and roles to isolate experiments across accounts.
Future Outlook
Plans include enhanced observability with AI‑driven topology awareness, intelligent resilience analysis, deeper eBPF‑based fault injection, and broader fault capabilities at kernel and hardware levels.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.