Design and Implementation of WMHA: A Modified MySQL High‑Availability Solution
This article explains the need for high‑availability MySQL services, critiques the original in‑house HA approach, and details how the mature MHA framework was extended into WMHA with added VIP monitoring, enhanced failover procedures, richer notifications, and a reorganized deployment structure to improve reliability and reduce DBA intervention.
High availability (HA) is a baseline requirement for Internet services; achieving 3‑nine availability allows up to eight hours of downtime per year, while 5‑nine availability limits downtime to five minutes.
The original in‑house MySQL HA solution at 58 Group suffered from several drawbacks: no data catch‑up during failover, inefficient single‑process scanning based on configuration files, lack of online switching, and limited health‑check metrics that risk false‑positive switches.
To address these issues with minimal cost, the team adopted the mature MHA (Master High Availability) framework and created a customized version called WMHA, adding features such as LVS VIP detection and a switch queue.
MHA consists of a Manager component that can run on a dedicated host or a slave, and Node components deployed on each MySQL server; the Manager periodically probes masters, promotes a suitable slave when a failure occurs, and strives to preserve data consistency by handling binary logs and supporting semi‑synchronous replication.
The MHA failover workflow includes saving binlog events from the crashed master, identifying the most up‑to‑date slave, applying relay‑log differences, promoting a slave to master, and re‑pointing other slaves.
Key MHA Manager tools include masterha_check_ssh , masterha_check_repl , masterha_manager , masterha_check_status , masterha_master_monitor , masterha_master_switch , and masterha_conf_host .
Key MHA Node tools include save_binary_logs , apply_diff_relay_logs , filter_mysqlbinlog , and purge_relay_logs .
WMHA modifies the detection flow by adding VIP connectivity checks alongside SSH and MySQL status checks, and updates VIP information during failover; diagrams illustrate the enhanced detection and switch processes.
The WMHA switch procedure follows these steps: read and validate configuration, perform SSH connectivity tests, re‑verify master status, check VIP health, gather replication and lag information, apply diff logs, and finally execute the data‑catch‑up and switch.
Message notifications in WMHA are expanded beyond the single‑type alerts of native MHA, categorizing events such as start/end of online and fault failovers, and supporting SMS, email, and future IM alerts.
WMHA adopts a per‑cluster directory layout, placing status files in each instance’s work directory and organizing configuration, logs, and binaries separately for easier management.
A maximum queue length parameter limits concurrent failovers to prevent large‑scale switches caused by network issues, logging “Too many failover…” when the threshold is exceeded.
Future plans include using etcd to build a highly available WMHA cluster and deploying sentinel agents across network segments to mitigate risks from poor connectivity.
In conclusion, while MHA remains a widely used HA solution, WMHA enhances its robustness, reduces manual DBA intervention, and better aligns with modern, high‑scale database environments.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.