Operations 12 min read

How Meituan‑Dianping Evolved MySQL HA from MMM to MHA‑Zebra and Beyond

This article traces Meituan‑Dianping's MySQL high‑availability journey from the early MMM replication manager to the modern MHA‑Zebra and MHA‑Proxy solutions, compares each architecture, highlights their shortcomings, and outlines future directions such as distributed agents, semi‑sync replication, and Paxos‑based MySQL Group Replication.

Meituan Technology Team

Jun 30, 2017

How Meituan‑Dianping Evolved MySQL HA from MMM to MHA‑Zebra and Beyond

MMM Architecture

Before 2015 Meituan‑Dianping used the Master‑Master replication manager (MMM) for MySQL high availability. The cluster exposed one write VIP and multiple read VIPs. Each MySQL node ran an mmm‑agent that sent heartbeats to an mmm‑manager. When heartbeats stopped the manager performed failover.

The manager handled two failure cases:

Read‑node failure – the manager removed the failed node’s read VIP and migrated it to another healthy node.

Master failure – after a timeout the manager placed a global lock, selected the most up‑to‑date slave, performed binlog catch‑up, and moved the write VIP to the new master.

Problems with MMM

Large numbers of VIPs make management difficult and can cause simultaneous VIP loss. mmm‑agent is overly sensitive; its own lack of HA can generate false alarms.

The manager is a single point of failure.

VIPs rely on ARP, limiting cross‑subnet or cross‑datacenter failover.

MMM is an old Google project with little community activity; many bugs required custom patches (see https://github.com/cenalulu/mysql-mmm).

Transition to MHA

In 2015 the architecture was switched to MHA (MySQL Master High Availability), originally developed by Facebook engineer Yoshinori Matsunobu. MHA focuses on master failover: when the master crashes, MHA selects the most up‑to‑date slave, synchronizes missing binlogs, and moves the write VIP to the new master.

MHA + Zebra (DAL)

Zebra is an internal Java database‑access middleware built on c3p0, providing read/write splitting, sharding, and SQL flow control. Integrated with MHA, Zebra updates ZooKeeper after a failover.

After MHA switches, it notifies the Zebra monitor, which marks the old master’s read traffic as offline in ZooKeeper.

Zebra monitor polls node health every 10–40 seconds and removes unhealthy nodes from ZooKeeper.

Clients listening to ZooKeeper instantly rebuild connections to the new master.

Removing VIP Dependency

Because VIP‑based failover cannot cross subnets, the VIP handling was removed from MHA. After a switch, MHA informs the Zebra monitor, which rewrites ZooKeeper entries so that the new master’s real IP becomes the write endpoint and the dead master’s read traffic is removed. This “VIP‑free” approach enables cross‑subnet and cross‑datacenter failover.

Remaining Issues

The MHA manager remains a single point of failure.

Asynchronous binlog replication can cause data loss during master crashes.

Large master‑slave lag increases catch‑up time.

Future Architecture Ideas

To mitigate these issues Meituan‑Dianping experimented with:

Deploying semi‑synchronous replication to achieve >95 % data‑loss‑free scenarios.

Using a distributed set of agents that elect a new master via a consensus protocol, eliminating the MHA manager single point.

Industry Practices for HA

Binlog Server for Zero‑Loss Master‑Slave Sync

A dedicated Binlog Server acts as a pseudo‑slave, ACK‑ing every write. In a failure the Binlog Server provides a reliable source for recovery, ensuring no data loss.

Distributed Agent HA

Each MySQL node runs an agent; on failure agents vote to elect a new master, removing reliance on a central manager.

MySQL Group Replication (MGR) with Paxos

Recent MySQL releases include Paxos‑based Group Replication. The failover logic is built into the database, requiring a majority of nodes to ACK writes. This adds latency and requires an odd number of nodes, but eliminates external failover components.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems High Availability Database Architecture mysql MMM MHA Zebra

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.