Cloud Native 17 min read

How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

This article explains ByteDance’s end‑to‑end chaos engineering practice for cloud‑native environments, covering its background, principles, comparison with traditional testing, the evolution of its internal platforms, and a detailed look at the Application Resilience Enhancement Service (ARES) and its core features.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

Chaos Engineering Introduction

ByteDance’s chaos engineering practice in cloud‑native scenarios is presented, including background, development history, and the Application Resilience Enhancement Service (ARES) product.

What Is Chaos Engineering

Chaos engineering is a methodology that conducts experiments on system infrastructure to proactively discover fragile points, enabling developers to fix issues and build resilient architectures.

Chaos Engineering vs. Traditional Testing

Traditional testing (unit, integration, system) covers application‑level concerns but cannot address complex fault scenarios such as network latency or server failures. Chaos engineering complements this by injecting faults to reveal hidden risks.

Chaos Engineering vs. Fault Injection

Fault Injection : Validates predefined faults one by one, limited by exhaustive enumeration.

Chaos Engineering : Exploratory approach that discovers unknown failures by deliberately disrupting services.

ByteDance’s Chaos Engineering Journey

Disaster Recovery Platform (2016) : Built an internal platform for fault injection and basic metric analysis.

Chaos Engineering Platform 2.0 (2019) : Simplified fault injection, added extensible models, automated metric analysis, and supported strong/weak dependency analysis.

Cloud‑Native Product – ARES : Delivered a ToB high‑availability service for cloud‑native environments.

Application Resilience Enhancement Service (ARES)

ARES follows chaos engineering principles, offering rich fault scenarios to improve fault tolerance and recoverability of distributed systems.

Exercise Workflow

Prepare experiment (plan, goals, scenarios).

Orchestrate experiment (services, tasks, schedule).

Start experiment.

Execute fault injection and collect metrics.

Analyze results.

Optimize system to achieve resilience.

Core Features

Key capabilities include experiment configuration, workflow orchestration, fault observation, reporting, high‑availability drills, personal workbench, multi‑cluster execution, multi‑scenario support, WebShell management, extensive fault types, flexible scope selection, process orchestration, fault monitoring, causal analysis using Bayesian time‑series models, observability integration, steady‑state hypothesis, permission management, fault plugins, and dashboards.

Multi‑Cluster and Multi‑Scenario Support

Supports simultaneous experiments across multiple Kubernetes clusters and also on physical or virtual machines, allowing precise control of fault blast radius and selection modes (all, random, fixed count, percentage).

WebShell Management

Provides a web‑based shell to access pods or nodes, view logs, and verify fault effectiveness.

Fault Types

Supports network, pod, system, host, DNS, Kubernetes, process, API, Java, Python, Golang, middleware, and custom faults, with continuous expansion.

Observability and Causal Analysis

Faults are verified through dedicated metrics; AI‑driven causal inference compares observed data with predicted counterfactuals to determine fault impact.

Steady‑State Hypothesis

Defines measurable steady‑state metrics (e.g., QPS, CPU) and uses operators to evaluate deviations during experiments.

Permission Management

Implements role‑based access control for resources, permissions, and roles to isolate experiments across accounts.

Future Outlook

Plans include enhanced observability with AI‑driven topology awareness, intelligent resilience analysis, deeper eBPF‑based fault injection, and broader fault capabilities at kernel and hardware levels.

cloud nativemicroservicesObservabilityKubernetesChaos EngineeringresilienceFault Injection
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.