Evolution of Meituan-Dianping MySQL High‑Availability Architecture: From MMM to MHA+Zebra and Beyond
This article reviews the evolution of Meituan‑Dianping's MySQL high‑availability architecture over recent years, detailing the transition from the MMM replication manager to MHA, the integration of Zebra and Proxy middleware, and future design considerations such as distributed agents, semi‑synchronous replication, and MySQL Group Replication.
Meituan‑Dianping has been operating MySQL databases at large scale for many years. The article introduces the evolution of its high‑availability (HA) architecture, starting with the MMM (Master‑Master replication manager) system used before 2015.
MMM provided one write VIP and multiple read VIPs for a MySQL cluster. Each node ran an mmm‑agent that reported heartbeats to an mmm‑manager . When heartbeats stopped, the manager performed failover: for a failed slave it moved the read VIP; for a failed master it locked the dead master, selected a candidate, performed binlog catch‑up, and migrated the write VIP.
While MMM served the company well, it suffered from several drawbacks: an excessive number of VIPs that were hard to manage, overly sensitive agents causing false VIP loss, a single‑point‑of‑failure manager, and reliance on ARP which limited cross‑datacenter HA. Moreover, MMM is an old Google‑origin project with little community activity.
Starting in 2015, Meituan‑Dianping replaced MMM with MHA (MySQL Master High Availability), originally developed by Facebook. MHA focuses on master‑node HA: when the master fails, it selects the most up‑to‑date slave as the new master, performs binlog catch‑up, and moves the write VIP.
To avoid split‑brain scenarios, the MHA manager was enhanced with rack‑aware probing, distinguishing network glitches from actual node failures.
The team also built Zebra, a Java‑based database access middleware (based on c3p0) that provides read/write splitting, sharding, and SQL flow control. Zebra works together with MHA: after an MHA failover, Zebra’s monitor updates ZooKeeper to mark the old master’s read traffic as offline, and continuously checks node health to adjust routing.
In addition to Zebra, a Proxy middleware is used for non‑Java applications. After an MHA switch, Proxy receives a notification to reconfigure read/write traffic, offering more flexibility at the cost of an extra network hop.
Despite these improvements, the MHA architecture still has two main issues: a single‑point‑of‑failure manager and potential data loss due to asynchronous binlog replication. To mitigate data loss, the team employs semi‑synchronous replication for critical services, achieving >95% data safety, and adopts a distributed‑agent design where agents elect a new master without a central manager.
Looking ahead, the article discusses several advanced HA approaches:
Deploying a Binlog Server that acknowledges writes, ensuring no data loss on master failure.
Using a distributed‑agent election protocol to eliminate the MHA manager’s single point.
Adopting MySQL Group Replication (MGR) based on Paxos, which pushes consistency and failover logic into the database layer, though it introduces write‑latency due to majority ACKs and requires an odd number of nodes.
Each of these solutions balances trade‑offs between availability, consistency, latency, and resource consumption.
In conclusion, the article summarizes the journey from MMM to MHA+Zebra and MHA+Proxy, highlights ongoing challenges in MySQL HA design, and emphasizes that continuous innovation is required to achieve more robust and scalable database systems.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.