Databases 15 min read

Design and Implementation of High‑Availability MySQL

The talk by Tencent’s senior MySQL engineer explains high‑availability concepts—99.95% uptime, RPO/RTO metrics, backup methods, and replication modes—while comparing single‑node, shared‑storage, and share‑nothing architectures, detailing failover tools (Keepalived, MMM, MHA), cluster solutions (PXC, MGC, Group Replication) and NewSQL examples such as Aurora, PolarDB and CynosDB.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Design and Implementation of High‑Availability MySQL

Wang Jiakun, senior engineer at Tencent and head of Tencent Cloud relational database MySQL, has extensive experience in client development and database R&D, including iOS, MySQL, PostgreSQL and SQL Server.

The talk focuses on the concept, importance, and implementation of MySQL high availability.

What is high availability? It is the total uptime of a service that provides reliable user access. In the cloud market, availability is often expressed as the number of nines; Tencent Cloud MySQL currently achieves 99.95% (about 25 minutes of downtime per year). The theoretical maximum is three nines (99.9%); achieving four or five nines is extremely difficult.

Why do we need high availability? Uncontrollable factors such as network outages, power failures, natural disasters, or human errors (e.g., accidental deletion of directories or tables) can cause service interruption. Ensuring data and user continuity is therefore essential.

Two key metrics are used to evaluate disaster recovery:

RPO (Recovery Point Objective): the amount of data loss from the start of a failure to recovery.

RTO (Recovery Time Objective): the time required to restore service after a failure.

How to achieve high availability? Three common architectural patterns are presented:

Single‑node storage (used often in gaming): a dedicated compute node with three‑replica storage for data reliability.

Shared‑storage (share‑disk) architecture, similar to Oracle RAC, where multiple compute nodes access the same storage.

Data‑replication (share‑nothing) architecture, which is the focus of this talk. It relies on binary log replication between a primary and one or more replicas.

Infrastructure HA includes concepts such as active‑active (same‑city dual‑active) and multi‑site (two‑site three‑center) deployments, ensuring network and server redundancy.

Backup strategies are divided into logical backup (mysqldump, MyDumper) and physical backup (Percona XtraBackup). Snapshots are also used in Tencent Cloud.

The binary log (binlog) records all data‑modifying SQL statements. It is written after the transaction’s prepare phase but before commit. Redo logs (InnoDB) are written during the prepare phase for each statement. Proper configuration of log‑sync parameters (the “double‑1” setting) is recommended to guarantee consistency.

Replication workflow:

Master writes binlog to disk.

Slave has an I/O thread that pulls the binlog and writes a relay log.

A SQL thread reads the relay log and replays the statements.

Replication modes:

Asynchronous: Master returns to the client immediately after committing.

Semi‑synchronous: Master returns after the binlog reaches the slave and is stored as a relay log.

Synchronous: Master waits until the slave has replayed the transaction.

The bottleneck is usually the SQL thread because it replays statements serially. Parallel replication (database‑level in MySQL 5.6, logical‑clock in 5.7, row‑level in 8.0) mitigates this delay.

Failover solutions include:

Keepalived – active‑passive monitoring, prone to split‑brain.

MMM – dual‑master with a standby slave, only one master writes at a time.

MHA – a management node decides failover based on health reports from agents.

Cluster HA architectures discussed: PXC, MGC, and MySQL Group Replication (MGR). MGR uses a Paxos‑based conflict‑resolution protocol and supports multi‑master writes, requiring primary keys for conflict detection.

NewSQL examples:

Aurora (AWS) – compute‑storage separation, six‑copy storage across three AZs, log‑as‑data design, heavy use of snapshots.

PolarDB (Alibaba Cloud) – RDMA‑based network, share‑disk architecture with ParallelRaft for data consistency.

CynosDB (Tencent Cloud) – upcoming NewSQL product combining the strengths of both approaches.

The presentation concludes with a Q&A session covering Tencent’s internal high‑availability architecture for gaming, strategies for low latency, data layering with Redis cache, and consistency guarantees.

High AvailabilityDatabase ArchitectureMySQLReplicationBackupRPORTO
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.