Operations 4 min read

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

A detailed account of GitHub’s recent worldwide outage reveals that a rollback of database infrastructure changes caused widespread service failures across GitHub.com, Pages, Copilot, and the API, highlighting the challenges of stateful database reliability in large platforms.

21CTO

Aug 15, 2024

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

GitHub, the widely used code repository and developer platform, experienced a major outage when its main site displayed an error message indicating that no servers were available, briefly recovering afterward while an angry unicorn image was shown.

During the incident, core services such as Pull Requests, GitHub Pages, Copilot, and the GitHub API were severely impacted, and the GitHub Status page posted its first update at 7:11 PM ET, followed by multiple service alerts.

More than 10,000 users reported the issue on Downdetector, and NetBlocks confirmed an international outage at 7:13 PM ET. Copilot also went down, prompting some users on Hacker News to joke about developers finally being able to "slack off".

GitHub did not immediately comment on the problem. Later, the status page announced that services had returned to normal after a major interruption.

Subsequent updates indicated that the outage was linked to recent changes in the database infrastructure, prompting a rollback of those changes. The article notes that while stateless services can recover easily, stateful databases pose significant challenges when they fail, and the exact database component involved was not identified.

By 8:26 PM ET, GitHub confirmed that the problematic database infrastructure changes had been reverted and that all services were fully operational.

The piece also mentions that since Microsoft’s $7.5 billion acquisition of GitHub in 2018, some users have perceived a decline in service stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Incident Management GitHub Outage

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.