Databases 16 min read

Improving MHA Network Tolerance: Testing ping_interval and secondary_check_script

This article analyzes how adjusting the MHA ping_interval and enabling secondary_check_script can increase MySQL high‑availability cluster tolerance to network packet loss and corruption, presenting test setups, parameter configurations, log observations, and conclusions on their impact on failover behavior.

Aikesheng Open Source Community

Mar 15, 2021

Improving MHA Network Tolerance: Testing ping_interval and secondary_check_script

Problem Description

MHA, although no longer updated, remains popular for its strong high‑availability capabilities; a customer experienced severe network packet loss that triggered a failover in an MHA‑managed MySQL master‑slave cluster. Because MHA requires manual re‑addition of nodes after each failover, frequent network issues can disrupt production.

Customer Question

How can the tolerance of MHA be improved if similar network problems recur?

Test Environment

The test environment mirrors the customer's production setup as closely as possible to achieve realistic results.

Test Parameters

ping_interval : The interval (in seconds) between manager‑node ping checks to the master. After three consecutive missed pings, the manager declares the master down. Default is 3 seconds.

secondary_check_script : An external script (masterha_secondary_check) that allows the manager to verify master availability via multiple remote hosts, adding extra network routes for health checks.

secondary_check_script = masterha_secondary_check -s remote_host1 -s remote_host2

The manager uses both remote_host1 and remote_host2 to assess master health; only when both checks fail is the master considered unavailable.

Test Results and Analysis

Network loss, corruption, retransmission, and latency scenarios were simulated using sysbench. Both packet loss and corruption caused MHA failover within a 5‑minute observation window.

Corrupted Packet Scenario

Failover occurred when packet corruption reached 70%.

When the corruption rate was 70% and secondary_check_script was disabled, increasing ping_interval reduced the likelihood of failover; enabling secondary_check_script improved tolerance.

Packet Loss Scenario

Failover occurred when packet loss reached 50%.

With the default ping_interval (3 s), MHA failed over during the test; increasing ping_interval and enabling secondary_check_script prevented failover under the same loss conditions.

Log Output Analysis

ping_interval Log Sample

10.186.63.40(10.186.63.40:7788) (current master)
+--10.186.63.153(10.186.63.153:7788)
+--10.186.63.52(10.186.63.52:7788)
Thu Jan 28 16:18:17 2021 - [info] Set master ping interval 3 seconds.
... (additional log lines showing warnings, timeouts, and failover decisions) ...

After setting ping_interval, the manager logs a ping every 3 seconds; three consecutive failures trigger a master‑unreachable warning and initiate failover.

secondary_check_script Log Sample

10.186.63.40(10.186.63.40:7788) (current master)
+--10.186.63.153(10.186.63.153:7788)
+--10.186.63.52(10.186.63.52:7788)
Mon Jan 25 15:51:36 2021 - [info] Set secondary check script: masterha_secondary_check -s 10.186.63.153 -s 10.186.63.52
... (additional log lines showing secondary checks, SSH results, and continued availability) ...

When secondary_check_script is configured, the manager performs additional network checks, and as long as at least one remote host reports the master reachable, failover is suppressed.

Conclusion

Single‑variable experiments show that both increasing ping_interval and enabling secondary_check_script extend the decision window and reduce the frequency of MHA failovers under poor network conditions. While tolerance improves, complete elimination of failover cannot be guaranteed.

The test environment differs from production; results have stochastic elements and should be used as reference.

secondary_check_script adds multiple network routes for master health verification, significantly enhancing tolerance.

Increasing ping_interval delays fault detection, which may be desirable for less critical workloads but prolongs outage detection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high availability mysql MHA Database operations Network Tolerance ping_interval secondary_check_script

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.