High Availability Solutions for MySQL and UDB: Techniques and Case Study
The article explains high‑availability concepts, compares typical MySQL HA architectures—including replication, clustering, and Paxos‑based solutions—and presents UDB’s dual‑master semi‑synchronous design with a Proxy layer that ensures automatic failover, data consistency, and operational resilience.
High Availability (HA) refers to keeping services continuously available, though true 100% uptime is unrealistic due to software bugs, hardware failures, third‑party dependencies, and natural disasters; industry measures HA with SLA levels expressed as “nines” of availability.
In cloud computing, HA is critical for services such as cloud databases, which face diverse failure scenarios. This article first outlines key HA concepts for databases—redundancy through clustering and automatic failover—to ensure minimal downtime.
Typical MySQL HA solutions are introduced, including MySQL Replication (asynchronous, semi‑synchronous, synchronous), MySQL Fabric, DRBD block replication, Solaris Clustering, MySQL Cluster, and Paxos‑based approaches such as Galera, Percona XtraDB Cluster, and MySQL Group Replication, each with varying availability levels.
The UDB (UCloud Database) case study demonstrates a dual‑master semi‑synchronous architecture with a Proxy layer that monitors backend health, performs automatic failover, and maintains data consistency during outages.
UDB addresses critical transaction loss by refining MySQL 5.7’s semi‑synchronous commit process, identifying crash‑prone stages, and implementing kernel‑level adjustments to roll back or truncate problematic binlog events.
To mitigate semi‑synchronous degradation, UDB adds dedicated communication channels between the Proxy and MySQL instances, allowing the standby to verify transaction synchronization before switching, and employs an auxiliary replication link to catch up lagging data after network jitter.
Proxy high availability is achieved with a master‑backup setup managed by Zookeeper; when the primary Proxy fails, a virtual IP is reassigned to the backup, which reconnects to the master database, ensuring uninterrupted service.
Additional operational measures include continuous monitoring of hardware, OS, database, and network, as well as a proprietary backup system for rapid data recovery.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
