Databases 14 min read

Designing a Zookeeper‑MHA MySQL High‑Availability Architecture: Key Insights

This talk explains how Lianjia redesigned its MySQL high‑availability setup by replacing VIP‑based failover with a Zookeeper‑driven naming service, detailing the original MHA architecture, its shortcomings, the new components, workflow, operational challenges, and a DNS‑based refinement.

ITPUB
ITPUB
ITPUB
Designing a Zookeeper‑MHA MySQL High‑Availability Architecture: Key Insights

Classic MHA‑Based MySQL HA Architecture

The traditional high‑availability solution uses MHA to monitor a MySQL master‑slave cluster. An external virtual IP (VIP) is presented to applications as the write endpoint; a separate VIP may be used for read traffic. Inside the cluster the master and standby share a VIP that is moved by a heartbeat mechanism when the master fails.

Issues with the VIP‑Centric Design

VIP is a single point of failure – if the VIP provider crashes, applications lose connectivity even though the MySQL nodes remain alive.

Keepalived‑based VIP failover can suffer split‑brain, causing unstable connections, possible dirty reads, and inconsistent routing.

Managing many VIPs (read/write, multiple clusters) leads to IP waste, configuration sprawl, and complex coordination during failover.

Refactor: Introducing a Naming Service (Zookeeper)

Instead of exposing VIPs, the architecture registers each MySQL instance as a service in Zookeeper . Applications query a logical name (e.g., mysql-prod) and receive the current IP address, making topology changes transparent.

Hide MySQL cluster topology from applications.

Make underlying MySQL changes invisible to the upper layer.

Core Components

MHA continues to provide centralized monitoring and automated master‑to‑standby switchover. When monitoring starts, MHA writes the cluster’s IP, port, and role to Zookeeper and updates the data on each failover.

Zookeeper (Name Service) stores service information under a hierarchical path, for example:

/mysql/3307/instance1  -> 192.168.10.21:3306
/mysql/3307/instance2  -> 192.168.10.22:3306

The port number (3307) is used as a unique identifier for the cluster; any naming convention may be applied.

MZAgent is a lightweight Java agent deployed on every application server. It subscribes to Zookeeper nodes, resolves the logical name to an IP, and writes the mapping into /etc/hosts. When Zookeeper data changes, the agent rewrites the hosts file instantly.

ZKClient Functions Used by MZAgent

subscribeChildChanges()

– watches for addition or removal of child nodes (e.g., cluster scaling). subscribeDataChanges() – watches leaf‑node value changes (IP updates) and triggers hosts‑file rewrite.

Service Registration Flow

MHA starts monitoring the MySQL cluster.

It writes a node under /mysql/<port>/<instance> containing the instance’s IP and role.

MZAgent on each app server subscribes to the parent path, resolves the logical name, and populates /etc/hosts with entries such as mysql-prod 192.168.10.21.

MySQL Failover Flow

When the master fails, MHA promotes a standby, updates the Zookeeper node with the new master’s IP, and publishes the change. MZAgent receives the notification, rewrites /etc/hosts, and the application reconnects using the same logical name without any code change.

Benefits of the Naming‑Service Architecture

No single point of failure for the naming layer – loss of an MZAgent does not affect MySQL availability.

Avoids VIP split‑brain scenarios.

Simplifies management of multiple clusters on a single host; each cluster is identified by a unique Zookeeper path.

Eliminates wasted IP addresses.

Operational Challenges Discovered

Scaling the MySQL cluster requires updating /etc/hosts on every application server, which is error‑prone.

Manual edits to the hosts file can break DB connectivity.

Different business units may need custom agent configurations, increasing maintenance overhead.

DNS‑Based Refinement

To remove host‑file management, the team replaced MZAgent with an internal DNS service (Dnsmasq). Each MySQL logical name is added as a DNS record that resolves to the current master IP. Short TTL values (3–5 seconds) or explicit cache purges guarantee rapid propagation of failover updates.

Using DNS eliminates the need for per‑host agents, provides built‑in load‑balancing for read replicas (by returning a random IP from a pool), and centralises configuration.

Key Considerations for DNS Deployment

Configure a very short TTL (e.g., 3 s) to ensure failover records are refreshed quickly.

Optionally purge the DNS cache programmatically during a switchover to guarantee immediate consistency.

Assign a unique internal domain name to each MySQL cluster (e.g., db-prod.example.internal).

Conclusion

The combination of MHA and a naming service (first Zookeeper, later internal DNS) removes the VIP‑related single point of failure and split‑brain problems, simplifies multi‑instance management, and conserves IP resources. Operational experience shows that host‑file based approaches introduce maintenance risk, which can be mitigated by moving to a DNS‑based resolution layer with short TTLs and cache‑purge mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityZooKeeperDatabase ArchitecturemysqlDNSMHANaming Service
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.