Databases 13 min read

How Meituan‑Dianping Evolved MySQL HA: From MMM to MHA+Zebra and Beyond

This article traces Meituan‑Dianping's MySQL high‑availability journey, detailing the legacy MMM system, the transition to MHA, integrations with Zebra and Proxy middleware, current challenges, and future designs such as distributed agents, semi‑sync replication, and MySQL Group Replication.

dbaplus Community

Aug 14, 2017

How Meituan‑Dianping Evolved MySQL HA: From MMM to MHA+Zebra and Beyond

MMM Architecture (pre‑2015)

Meituan‑Dianping originally used MMM (Master‑Master replication manager for MySQL) to provide high availability. Each MySQL node ran an mmm‑agent that reported heartbeats to an mmm‑manager. The cluster exposed one write VIP and multiple read VIPs. When the manager stopped receiving heartbeats, it performed failover by moving VIPs.

Failover scenarios:

If a replica fails, the manager removes its read VIP and drifts the VIP to another healthy node.

If the master fails, the manager may lock the dead master, select a candidate replica, perform binlog catch‑up, then move the write VIP to the new master.

Problems with MMM:

Excessive number of VIPs makes management hard and can cause simultaneous VIP loss. mmm‑agent is overly sensitive; its own failure can cause false alarms. mmm‑manager is a single point of failure.

VIPs rely on ARP, limiting cross‑subnet or cross‑datacenter HA.

MMM is an old, sparsely maintained Google project; many bugs required local patches (see https://github.com/cenalulu/mysql-mmm).

MHA Architecture (from 2015)

To address MMM’s shortcomings, the team migrated to MHA (MySQL Master High Availability), originally developed by Facebook. MHA handles only master failover: when the master goes down, it selects the most up‑to‑date replica, performs binlog catch‑up, and drifts the write VIP to the new master.

Optimizations were added to avoid split‑brain situations caused by network glitches. MHA Manager now probes other machines in the same rack to distinguish network failures from actual node failures.

MHA + Zebra (DAL) Integration

Zebra is an internal Java database access middleware built on c3p0, providing read/write splitting, sharding, and SQL flow control. When MHA completes a failover, it notifies the Zebra monitor, which updates ZooKeeper to mark the old master’s read traffic as offline.

Failover flow:

MHA finishes the master switch and sends a message to Zebra monitor.

Zebra monitor updates ZooKeeper; client connections automatically re‑establish using the new configuration.

MHA + Proxy Integration

In addition to Zebra, a Proxy‑based middleware is used for non‑Java applications. After MHA switches, Proxy is notified to adjust read/write traffic. This adds flexibility but introduces an extra network hop, increasing response time and potential failure rate. Documentation is available at https://github.com/Meituan-Dianping/DBProxy.

Future Architecture Considerations

Remaining issues with the current MHA‑based design:

Manager node remains a single point of failure.

Asynchronous binlog replication can cause data loss during failover.

Large master‑slave lag increases catch‑up time.

Proposed mitigations:

Enable semi‑synchronous replication for critical services, achieving >95% data‑loss‑free scenarios.

Deploy a distributed Agent on every MySQL node; agents participate in an election to select a new master, eliminating the manager single point.

Introduce a Binlog Server that acknowledges writes before they are considered committed, ensuring no data loss on master failure.

Distributed Agent HA Design

Each node runs an Agent; upon failure, agents vote to elect a suitable replica as the new master, removing reliance on a central manager.

MySQL Group Replication (MGR) with Paxos

Recent MySQL community work introduces Paxos‑based Group Replication (MGR). Consistency and failover are handled internally, hiding complexity from upper layers.

During a failure, the cluster performs an internal leader election and switches automatically, then pushes the new topology to Zebra monitor for traffic reconfiguration. Drawbacks include the need for majority ACKs on each write (adding latency) and the requirement for an odd number of nodes (minimum three), increasing resource usage.

Conclusion

The article outlines Meituan‑Dianping's progression from MMM to MHA‑based solutions, the integration of Zebra and Proxy middleware, and explores emerging industry practices such as distributed agents, semi‑sync replication, and Paxos‑driven Group Replication. While no perfect HA solution exists, continuous innovation drives more resilient MySQL deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems proxy High Availability Database Architecture MySQL MHA Zebra

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.