Operations 7 min read

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

The article examines Salesforce’s five‑hour global outage caused by a shortcut DNS deployment and the subsequent recovery challenges, then explores a viral experiment where twenty smartphones generated artificial traffic congestion, illustrating how real‑time data feeds and operational safeguards can prevent large‑scale service disruptions.

ByteDance SE Lab

Jul 30, 2021

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

We gather major industry incidents and discuss their significance.

Global Outage Caused by Improper Operations

Event Overview

At around 21:00 UTC on May 11, Salesforce suffered a five‑hour worldwide outage that affected roughly 150,000 customers. The incident began when a maintenance engineer took a shortcut to fix a bug, inadvertently misconfiguring Salesforce’s DNS servers and preventing access to multiple core SaaS products.

Event Tracking

Salesforce is a leading cloud‑based CRM provider and a pioneer of SaaS, serving millions of employees across about 150,000 organizations worldwide.

After the outage, Salesforce CTO Parker Harris tweeted, “Due to DNS issues our services were inaccessible. We recognize the significant impact on customers and are working tirelessly on a solution. While we remediate, customers may continue to experience problems. Resolving this remains our top priority.” After nearly four hours of effort, the service began to recover partially.

Cause of the Outage

Post‑mortem analysis revealed that an engineer needed to perform a DNS configuration change. According to the proper process, this change required a manual “cross‑upgrade” deployment to production to limit potential impact.

The engineer, however, bypassed the standard procedure and used an “Emergency Bug Fix (EBF)” process, deploying a script that had been stable for four years but contained a hidden bug: under high load the script could timeout, preventing subsequent commands from executing.

The shortcut deployment and the script bug caused the script to timeout in multiple data centers. Because the deployment was global, almost all servers were affected, leading to tasks not starting correctly.

Emergency Mitigation Ran Into Another Issue

When the incident occurred, the Salesforce team attempted to use a recovery tool that required the DNS servers to be operational, which delayed full restoration.

Post‑mortem Reflections

The incident highlights two major factors: human error—failing to follow a staged, incremental deployment—and technical shortcomings—manual DNS script changes and insufficient testing of the deployment script.

Avoid any manual global deployment operations.

Automate the entire change‑management workflow.

Increase test coverage for deployment scripts.

Ensure recovery tools can operate independently of DNS services.

20 Phones Create Traffic Congestion

Event Overview

A recent short video shows a creator using twenty smartphones to deliberately generate a traffic jam on a navigation app.

All twenty phones open the same map app and set the identical destination, then start moving slowly.

The driver stops near a parking spot while the road remains empty.

After two minutes, every navigation route turns red, even purple, and the app displays a “Severe Congestion” warning.

Repeating the test with another map app yields the same result.

Event Tracking

The navigation apps do not rely on satellite GPS for real‑time traffic; they aggregate crowd‑sourced location data. By feeding twenty identical location signals, the system interprets a stationary cluster of vehicles as a traffic jam, demonstrating the power of big‑data analytics in modern navigation.

Post‑mortem Reflections

The video prompts a question: if no one used navigation apps during a jam, how would the system display traffic conditions?

Big Data cloud computing operations incident management SaaS Traffic analysis

Written by

ByteDance SE Lab

Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.