Improving MHA Network Tolerance: Testing ping_interval and secondary_check_script
This article analyzes how adjusting the MHA ping_interval and enabling secondary_check_script can increase MySQL high‑availability cluster tolerance to network packet loss and corruption, presenting test setups, parameter configurations, log observations, and conclusions on their impact on failover behavior.
Problem Description
MHA, although no longer updated, remains popular for its strong high‑availability capabilities; a customer experienced severe network packet loss that triggered a failover in an MHA‑managed MySQL master‑slave cluster. Because MHA requires manual re‑addition of nodes after each failover, frequent network issues can disrupt production.
Customer Question
How can the tolerance of MHA be improved if similar network problems recur?
Test Environment
The test environment mirrors the customer's production setup as closely as possible to achieve realistic results.
Test Parameters
ping_interval : The interval (in seconds) between manager‑node ping checks to the master. After three consecutive missed pings, the manager declares the master down. Default is 3 seconds.
secondary_check_script : An external script (masterha_secondary_check) that allows the manager to verify master availability via multiple remote hosts, adding extra network routes for health checks. secondary_check_script = masterha_secondary_check -s remote_host1 -s remote_host2
The manager uses both remote_host1 and remote_host2 to assess master health; only when both checks fail is the master considered unavailable.
Test Results and Analysis
Network loss, corruption, retransmission, and latency scenarios were simulated using sysbench. Both packet loss and corruption caused MHA failover within a 5‑minute observation window.
Corrupted Packet Scenario
Failover occurred when packet corruption reached 70%.
When the corruption rate was 70% and secondary_check_script was disabled, increasing ping_interval reduced the likelihood of failover; enabling secondary_check_script improved tolerance.
Packet Loss Scenario
Failover occurred when packet loss reached 50%.
With the default ping_interval (3 s), MHA failed over during the test; increasing ping_interval and enabling secondary_check_script prevented failover under the same loss conditions.
Log Output Analysis
ping_interval Log Sample
10.186.63.40(10.186.63.40:7788) (current master)
+--10.186.63.153(10.186.63.153:7788)
+--10.186.63.52(10.186.63.52:7788)
Thu Jan 28 16:18:17 2021 - [info] Set master ping interval 3 seconds.
... (additional log lines showing warnings, timeouts, and failover decisions) ...After setting ping_interval, the manager logs a ping every 3 seconds; three consecutive failures trigger a master‑unreachable warning and initiate failover.
secondary_check_script Log Sample
10.186.63.40(10.186.63.40:7788) (current master)
+--10.186.63.153(10.186.63.153:7788)
+--10.186.63.52(10.186.63.52:7788)
Mon Jan 25 15:51:36 2021 - [info] Set secondary check script: masterha_secondary_check -s 10.186.63.153 -s 10.186.63.52
... (additional log lines showing secondary checks, SSH results, and continued availability) ...When secondary_check_script is configured, the manager performs additional network checks, and as long as at least one remote host reports the master reachable, failover is suppressed.
Conclusion
Single‑variable experiments show that both increasing ping_interval and enabling secondary_check_script extend the decision window and reduce the frequency of MHA failovers under poor network conditions. While tolerance improves, complete elimination of failover cannot be guaranteed.
The test environment differs from production; results have stochastic elements and should be used as reference.
secondary_check_script adds multiple network routes for master health verification, significantly enhancing tolerance.
Increasing ping_interval delays fault detection, which may be desirable for less critical workloads but prolongs outage detection.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.