Operations 8 min read

Slack's Deployment Process: Balancing Speed and Reliability

This article explains how Slack’s engineering team designs a multi‑stage deployment pipeline—including release branches, staging, dogfood, canary, and percentage rollouts—while emphasizing rapid iteration, visibility, and reliability through fast and atomic deployment mechanisms.

High Availability Architecture

Apr 8, 2020

Slack’s engineering culture values rapid iteration, quick feedback loops, and responsive handling of customer input, which requires a deployment system that balances speed with reliability. To support growing workload and maintain these values, Slack continuously improves visibility and robustness in its deployment workflow.

Deployment Process

Every pull request must pass code review and all tests before being merged into master. Merges are deployed only during North American business hours to ensure sufficient staff are available for any unexpected issues. Approximately twelve planned deployments occur each day, each led by a designated deployment commander who oversees the multi‑step rollout, monitors for errors, and can roll back or hot‑fix as needed.

Creating a Release Branch

At the start of each version a new release branch is created, marking a point in Git history for the release and providing a place to fix issues discovered during the release cycle.

Deploying to Staging

The build is first deployed to a staging environment where automated smoke tests run. Staging is a production‑like environment that does not receive public traffic, allowing additional manual testing to verify that code changes work correctly before proceeding.

Dogfood and Canary Environments

Deployment to production begins with the “dogfood” environment, which serves Slack’s internal workspaces and helps surface many issues. Once core functionality is verified, the build is promoted to a canary environment that receives roughly 2 % of production traffic.

Percentage‑Based Rollout to Production

If metrics remain stable and no alerts fire, the rollout proceeds in incremental percentages (10 %, 25 %, 25 %, 50 %, 75 %, 100 %). This gradual exposure allows the team to investigate any spikes or anomalies before full traffic is shifted.

Handling Problems During Deployment

When issues arise, a trained deployment commander collaborates with the author of the offending PR to investigate, isolate, and revert the change if necessary. If the root cause cannot be quickly identified, the service is restored by rolling back to the previous version.

Fast Deployment

Initially Slack’s entire application ran on ten Amazon EC2 instances and deployments were performed by a simple rsync to all servers after staging validation. As the number of servers grew, this push‑based model became a bottleneck, prompting a shift to a fully parallel pull‑based system where each server pulls the new build after a Consul key change, enabling rapid scaling.

Atomic Deployment

To eliminate the brief window of inconsistency during file copy, Slack introduced atomic deployments using hot and cold directories. New code is copied to the cold directory while the hot directory continues serving traffic; once the copy completes, traffic is switched instantly to the new directory, preventing API errors and page failures.

Emphasizing Reliability

In 2018 Slack recognized that excessive deployment speed was harming product stability. A comprehensive redesign introduced percentage‑based rollouts, continued use of fast and atomic deployment techniques, layered deployment changes, improved monitoring, and tooling to detect and mitigate bugs before they affect all users. Ongoing improvements focus on better tools and automation.

Original article: https://slack.engineering/deploys-at-slack-cd0d28c61701

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations deployment Reliability Continuous Integration canary releases Staging

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.