How Meituan Dianping Evolved MySQL HA: From MMM to MHA+Zebra and Beyond
This article traces Meituan Dianping's MySQL high‑availability journey, detailing the legacy MMM system, its migration to MHA, integration with Zebra and Proxy middleware, and future architectural ideas such as distributed agents, semi‑sync replication, and MySQL Group Replication.
Meituan Dianping has continuously refined its MySQL high‑availability (HA) architecture over the past few years, moving from an early MMM (Master‑Master replication manager) solution to a more robust MHA‑based stack, and exploring future designs to address remaining challenges.
MMM Architecture and Limitations
Before 2015 the company relied on MMM, which provided one write VIP and multiple read VIPs across MySQL nodes. Each node ran an mmm‑agent that reported heartbeats to an mmm‑manager. The manager handled failures as follows:
If a slave failed, the manager removed its read VIP and migrated the VIP to another healthy node.
If the master failed, the manager would lock the dead master, select a candidate slave as the new master, perform binlog‑based data catch‑up, then move the write VIP to the new master and re‑attach other nodes.
Key problems of MMM included:
Proliferation of VIPs making management difficult.
Over‑sensitive agents causing false VIP loss and manager mis‑judgments.
Single‑point failure of the manager.
ARP‑based VIPs limited HA to the same LAN segment.
MMM was originally a Google project, now unmaintained, and Meituan contributed patches to the open‑source community.
Adoption of MHA
Starting in 2015 the team switched to MHA (MySQL Master High Availability), a tool from Facebook that focuses solely on master HA. When the master fails, MHA selects the most up‑to‑date slave, synchronises missing binlogs, and moves the write VIP to the new master.
The basic one‑master‑one‑slave architecture is illustrated below:
To avoid split‑brain scenarios caused by network glitches, the MHA manager now probes other machines in the same rack to distinguish network failures from node failures before triggering a switch.
Integration with Zebra (DAL)
Zebra is an internal Java database‑access middleware that provides read/write splitting, sharding, and SQL flow control. Combined with MHA, Zebra becomes a critical component of the HA pipeline.
When MHA completes a failover, it notifies the Zebra monitor, which updates ZooKeeper configuration to mark the former master’s read traffic as offline. Zebra also periodically checks node health (every 10‑40 seconds) and removes unhealthy nodes from the routing table.
After a node change, client applications receive the new configuration, establish fresh connections, and gracefully close old ones. The overall flow is shown here:
Proxy Middleware
In addition to Zebra, Meituan also uses a Proxy‑based middleware that works with MHA. After a failover, MHA notifies the Proxy to adjust read/write traffic. Proxy offers broader language support but adds an extra network hop, increasing response time and potential failure rate.
Future Architecture Considerations
Remaining issues with the current MHA design are the manager’s single point of failure and possible data loss due to asynchronous binlog replication. The team addresses these by:
Deploying semi‑synchronous replication for critical workloads, achieving >95 % data‑loss protection.
Using a distributed set of agents that elect a new master when a node fails, eliminating the manager bottleneck.
Industry‑level alternatives explored include:
Binlog Server
A dedicated Binlog Server acts as a synthetic slave, acknowledging each write. In a failure, data can be recovered directly from the Binlog Server, preventing loss.
Distributed Agent HA
Each MySQL node runs an agent; upon failure, agents participate in an election to promote a suitable slave, removing reliance on a central manager.
MySQL Group Replication (MGR)
Leveraging Raft/Paxos, MGR provides built‑in HA and consistency. When a failure occurs, the cluster switches internally and updates Zebra via ZooKeeper. Drawbacks are the need for majority ACKs on each write (latency overhead) and the requirement for an odd number of nodes (minimum three), increasing resource usage.
Conclusion
The article outlines Meituan Dianping’s progression from MMM to MHA‑plus‑Zebra and MHA‑plus‑Proxy, compares these with other HA approaches, and acknowledges that no single solution is perfect. Ongoing research focuses on eliminating single points of failure, reducing data‑loss risk, and achieving cross‑segment HA through distributed agents and advanced replication techniques.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
