Operations 6 min read

Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

The July 13, 2020 GitHub outage, triggered by load‑balancer misconfiguration, a database connection error during partitioning, and a network‑config mistake, sparked worldwide developer panic, highlighted reliability concerns, and revealed challenges in scaling cloud infrastructure amid the pandemic.

21CTO
21CTO
21CTO
Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

On the afternoon of July 13, 2020, GitHub – the world’s largest code‑hosting platform – experienced a severe outage that quickly trended on Chinese social media.

Developers saw the typical 500 error page, then a whimsical horse illustration, leading many to wonder if the site had been blocked.

GitHub’s status page posted a series of updates:

2020‑07‑13 04:06 – “We are investigating reports of degraded performance and increased error rates.”

2020‑07‑13 05:53 – “We have identified the source of elevated errors and are working on recovery.”

2020‑07‑13 07:18 – “We continue working on the recovery of our services.”

2020‑07‑13 08:08 – “Work continues on the recovery of our services.”

Analysts questioned GitHub’s reliability after three separate incidents earlier that year.

GitHub later attributed the July outage to three root causes:

Misconfiguration of the software load‑balancer that broke internal routing between GitHub.com and its dependent services.

An erroneous database‑connection setting introduced during a data‑partitioning effort, which unintentionally reached production.

A network configuration that was inadvertently applied to the production network.

The company also acknowledged that its simulated test environment differed from production, limiting the ability to detect such configuration issues before deployment.

GitHub runs most of its platform on bare‑metal infrastructure with a Clos network topology, where each device shares routes via BGP.

Acquired by Microsoft in 2018 for $7.5 billion, GitHub supports over 50 million developers, making large‑scale outages highly impactful.

The COVID‑19 pandemic further strained data‑center capacity as remote work surged, causing supply‑chain challenges for hardware. AMD reported that a cloud provider added 10 000 servers in ten days to meet demand.

Ultimately, the outage highlighted the gap between simulated and production environments as a key vulnerability.

21CTO综合自外媒,非著名程序员等媒体,一并致谢。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingGitHubservice reliabilityInfrastructureOutage
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.