Operations 11 min read

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack’s resilience engineering team outlines a structured chaos‑engineering workflow—identifying potential failures, ensuring fault tolerance, and deliberately injecting faults in development and production—to safely test system robustness, validate hypotheses, and continuously improve reliability through regular disaster‑theater exercises.

DevOps

Jun 2, 2021

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack, one of the fastest‑growing SaaS products, has built a large, complex service that requires rigorous reliability testing. To avoid relying on luck, the company adopted a scientific, hypothesis‑driven chaos engineering process that treats fault‑injection as a regular, safe practice.

The process begins with three core principles: make the development environment a confidence‑building space for fault‑tolerance testing, test all systems (not just new ones), ensure tests never impact real users, and conduct all experiments in production under controlled conditions.

Since January 2018 Slack follows a strict three‑step workflow: (1) identify possible failures, (2) verify the service can tolerate those failures, and (3) deliberately inject the failures into production. This “disaster theater” is the first step of chaos engineering, distinct from the more extensive Netflix‑style practices.

Preparation for each exercise involves drafting a detailed, shared plan that includes a clear hypothesis describing how a failure will propagate through upstream and downstream systems. The plan lists the exact commands, affected EC2 instances (referred to as “offerings”), required logs, metrics, alerts, and a run‑book for verification.

During the run, the team announces the start in a dedicated ops channel, injects faults first in the development environment, monitors Grafana dashboards and Kibana searches, and only proceeds to production if the observed behavior matches expectations. If the fault causes excessive impact or does not behave as hypothesized, the exercise is halted and announced.

After each run, the team conducts a thorough post‑mortem, documenting root‑cause findings, user impact, need for manual intervention, severity, plan accuracy, and dashboard relevance. These findings feed back into improving the system and the chaos‑engineering process.

Slack has performed dozens of disaster‑theater drills, with three notable successes: (1) a cache‑inconsistency test that revealed a flaw in memcached lease handling, (2) a series of network‑partition experiments on channel servers that validated AWS regional fault tolerance and Consul‑based routing, and (3) a failure in the custom configuration deployment tool (Confabulator) that exposed missing timeouts and led to extensive retry‑strategy improvements.

Looking forward, the resilience team plans to expand and iterate on this process, running more regular drills to maintain and grow customer trust as Slack’s product suite evolves.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations chaos engineering incident response Reliability Fault Injection Slack

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.