How GitHub Upgraded 1,200 MySQL Servers from 5.7 to 8.0 Without Downtime
GitHub detailed a year‑long, multi‑team effort to upgrade over 1,200 MySQL hosts from version 5.7 to 8.0, describing the motivations, infrastructure scale, preparation steps, a staged rollout plan, rollback strategies, challenges faced, and key lessons learned for large‑scale database migrations.
Motivation
GitHub upgraded its MySQL fleet from 5.7 to 8.0 because 5.7 was approaching end‑of‑life and MySQL 8.0 provides security patches, bug fixes, performance improvements, and new features such as online DDL, hidden indexes, and compressed binary logs.
Scale of the MySQL Deployment
~1,200 hosts (Azure VMs and bare‑metal) across multiple data centers.
>300 TB of data stored in more than 50 database clusters, handling ~5.5 million queries per second.
Each cluster runs a primary‑replica HA pair.
Data is sharded both horizontally (Vitess) and vertically to isolate product domains.
Tooling ecosystem includes Percona Toolkit, gh‑ost, Orchestrator, Freno, and internal automation for cluster operations.
Preparation and Compatibility Checks
Define MySQL 8.0 default configuration (e.g., character_set_server=utf8, collation_server=utf8_unicode_ci) to remain compatible with existing 5.7 replicas.
Run benchmark suites on a representative subset of clusters to validate performance and identify version‑specific regressions.
Extend CI pipelines to start both MySQL 5.7 and 8.0 containers in parallel; detect deprecations (e.g., removed query cache) and reserved keywords.
Provide developers with a pre‑built MySQL 8.0 container for local testing and a dedicated pre‑production MySQL 8.0 cluster.
Upgrade Strategy
Step 1 – Rolling upgrade of read‑only replicas
For each cluster, take a single replica offline, upgrade it to 8.0, and run basic health checks (replication lag, query latency, system metrics). Once stable, route read traffic to the upgraded replica. Repeat until all replicas in a data center run 8.0, while keeping a sufficient number of 5.7 replicas as a rollback pool.
Step 2 – Re‑configure replication topology
After all read traffic is served by 8.0 replicas, promote an 8.0 replica to act as a new primary candidate that replicates from the existing 5.7 primary. Create two downstream chains:
A standby chain of 5.7 replicas (offline, ready for rollback).
An active chain of 8.0 replicas (serving traffic).
Step 3 – Graceful failover to MySQL 8.0 primary
Use Orchestrator to perform a controlled failover, promoting the 8.0 replica to primary. The final topology consists of one 8.0 primary, an offline 5.7 rollback chain, and an online 8.0 replica chain. Orchestrator also blacklists the old 5.7 primary to prevent accidental failback.
Step 4 – Upgrade non‑production and backup instances
After the primary clusters are stable on 8.0, upgrade all backup, staging, and internal tooling instances to keep the environment consistent.
Step 5 – Cleanup
Run a full 24‑hour production traffic validation. Once no regression is observed, decommission the remaining 5.7 instances.
Rollback Capability
The plan retains a full rollback path:
Read‑only traffic can be switched back to 5.7 replicas instantly if 8.0 performance degrades.
Primary rollback is possible because replication from 8.0 to 5.7 is forced to use compatible settings (utf8 charset, utf8_unicode_ci collation) and temporary role‑based privilege adjustments are applied during the upgrade window.
Key Technical Challenges
Vitess Sharding
Vitess clusters required coordinated upgrades of both MySQL instances and the VTgate proxy. Some client libraries (e.g., Java) depended on the query cache, which was removed in 8.0; the VTgate configuration was updated to advertise the new version after each shard upgrade.
Replication Lag and Bugs
Early testing uncovered a replication error that was fixed in MySQL 8.0.28; the upgrade therefore targeted 8.0.28 or newer. Higher write throughput in 8.0 increased lag, so Freno was tuned to rate‑limit writes based on observed lag metrics.
Production Query Failures
Large WHERE IN clauses that passed CI caused crashes on 8.0 under real load. The offending queries were rewritten, and query sampling combined with Solarwinds DPM (VividCortex) was used to surface such patterns before they reached production.
Lessons Learned
Extensive observability (metrics, query sampling, replication health) is essential for a safe, incremental upgrade.
Automated testing against both MySQL versions catches deprecations early.
Maintaining a mixed‑version environment during the rollout provides a safety net but requires careful configuration management (character set, collation, role privileges).
Sharding isolates risk; upgrading one Vitess shard at a time limits blast radius.
Tooling such as Orchestrator, Percona Toolkit, and Freno proved critical for topology changes and lag mitigation.
Conclusion
The year‑long, phased upgrade demonstrates that large‑scale MySQL migrations can be performed with zero SLO impact when backed by robust automation, observability, and a well‑tested rollback strategy. The experience establishes a repeatable process for future MySQL version upgrades across GitHub’s growing fleet.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
