Databases 17 min read

How GitHub Upgraded 1,200 MySQL Servers to 8.0 Without Downtime

GitHub’s engineering team detailed a year‑long, multi‑team effort to upgrade over 1,200 MySQL hosts from 5.7 to 8.0, preserving high availability, SLO compliance, and rollback capability while introducing new features and performance improvements.

21CTO
21CTO
21CTO
How GitHub Upgraded 1,200 MySQL Servers to 8.0 Without Downtime

Five years ago GitHub began as a Ruby on Rails application with a single MySQL database. Today MySQL remains the core relational database for GitHub’s infrastructure.

This article shares how GitHub’s engineering team upgraded more than 1,200 MySQL hosts to version 8.0 without violating service‑level objectives (SLOs). The planning, testing, and execution spanned over a year and required coordination across multiple internal teams.

Why upgrade to MySQL 8.0?

MySQL 5.7 was approaching end‑of‑life, prompting GitHub to move to the next major release. MySQL 8.0 provides immediate security patches, bug fixes, performance enhancements, and new features such as instant DDL, invisible indexes, and compressed binary logs.

GitHub’s MySQL infrastructure

The fleet consists of over 1,200 machines, a mix of Azure VMs and bare‑metal servers.

More than 50 database clusters store over 300 TB of data and serve 5.5 million queries per second.

Each cluster uses a primary‑plus‑replica high‑availability configuration.

Data is sharded both horizontally and vertically; some clusters are dedicated to specific product domains, while Vitess clusters handle large‑scale sharding.

A rich tooling ecosystem (Percona Toolkit, gh‑ost, Orchestrator, Freno, and internal automation) supports operations.

Preparation

Key requirements included upgrading each MySQL database without breaching SLOs, maintaining the ability to roll back to 5.7, and performing atomic upgrades per cluster.

Infrastructure preparation involved defining appropriate MySQL 8.0 defaults, running baseline performance benchmarks, and ensuring tooling could handle mixed‑version environments. CI pipelines were updated to run both MySQL 5.7 and 8.0 in parallel, catching incompatibilities early. Developers could also use a pre‑built MySQL 8.0 container in GitHub Codespaces for testing.

Communication was managed via a rolling calendar in GitHub Projects, with issue templates to track application‑team and database‑team checklists.

Upgrade plan

A progressive upgrade strategy with checkpoints and rollback points was adopted.

Step 1: Replica rolling upgrade

Engineers upgraded a single replica while it was offline, verified basic functionality, then gradually brought it back online, monitoring query latency and system metrics. This process was repeated across data centers, always keeping enough 5.7 replicas online for potential rollback.

Step 2: Update replication topology

Promoted an 8.0 replica as a candidate primary beneath the existing 5.7 primary.

Created two downstream replication chains: one of 5.7 replicas (offline, ready for rollback) and one of 8.0 replicas (serving traffic).

The new topology was held only briefly before moving to the next step.

Step 3: Promote 8.0 replica to primary

Using Orchestrator, the team performed an elegant failover, making the 8.0 replica the primary. The topology then consisted of an 8.0 primary, an offline 5.7 replica for rollback, and an 8.0 replica serving reads. Orchestrator also black‑listed the 5.7 primary to prevent accidental failover.

Step 4: Internal instance‑type upgrade

Auxiliary servers used for backups or non‑production workloads were upgraded to maintain consistency.

Step 5: Cleanup

After confirming a full 24‑hour traffic cycle without issues, the 5.7 servers were decommissioned.

Rollback capability

Maintaining enough online 5.7 replicas ensured the ability to roll back read traffic if 8.0 performance degraded. For the primary, bidirectional replication between 8.0 and 5.7 was required to enable safe rollback without data loss.

Challenges

Key technical challenges included:

Character‑set and collation differences (utf8mb4_0900_ai_ci vs. utf8mb4_unicode_520_ci) requiring a fallback to utf8 and utf8_unicode_ci.

MySQL 8.0 role‑based permissions not present in 5.7, causing replication breakage that was mitigated by temporarily adjusting user privileges.

Vitess sharding required updating VTgate to advertise the new version.

Replication delay bugs (e.g., replica_preserve_commit_order ticket) that were fixed in MySQL 8.0.28.

Large WHERE IN queries causing crashes, addressed by query rewriting and monitoring with Solarwinds DPM.

Lessons learned

The upgrade highlighted the importance of observability, thorough testing, and robust rollback mechanisms. Incremental rollout allowed early detection of issues, reducing risk during the major upgrade. Consistent client‑connection configurations proved vital for maintaining backward replication.

Scaling the number of clusters from five to over fifty required investment in tooling, automation, and processes. Future upgrades will benefit from the automation and self‑healing capabilities built during this project.

Conclusion

MySQL upgrades are routine maintenance for large‑scale services, but they demand careful planning, automation, and observability. GitHub’s experience demonstrates that with disciplined processes and cross‑team collaboration, even massive, high‑availability database fleets can be upgraded safely and efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityMySQLReplicationGitHubDatabase Migrationoperational excellence
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.