Designing a Zookeeper‑MHA MySQL High‑Availability Architecture: Key Insights
This talk explains how Lianjia redesigned its MySQL high‑availability setup by replacing VIP‑based failover with a Zookeeper‑driven naming service, detailing the original MHA architecture, its shortcomings, the new components, workflow, operational challenges, and a DNS‑based refinement.
Classic MHA‑Based MySQL HA Architecture
The traditional high‑availability solution uses MHA to monitor a MySQL master‑slave cluster. An external virtual IP (VIP) is presented to applications as the write endpoint; a separate VIP may be used for read traffic. Inside the cluster the master and standby share a VIP that is moved by a heartbeat mechanism when the master fails.
Issues with the VIP‑Centric Design
VIP is a single point of failure – if the VIP provider crashes, applications lose connectivity even though the MySQL nodes remain alive.
Keepalived‑based VIP failover can suffer split‑brain, causing unstable connections, possible dirty reads, and inconsistent routing.
Managing many VIPs (read/write, multiple clusters) leads to IP waste, configuration sprawl, and complex coordination during failover.
Refactor: Introducing a Naming Service (Zookeeper)
Instead of exposing VIPs, the architecture registers each MySQL instance as a service in Zookeeper . Applications query a logical name (e.g., mysql-prod) and receive the current IP address, making topology changes transparent.
Hide MySQL cluster topology from applications.
Make underlying MySQL changes invisible to the upper layer.
Core Components
MHA continues to provide centralized monitoring and automated master‑to‑standby switchover. When monitoring starts, MHA writes the cluster’s IP, port, and role to Zookeeper and updates the data on each failover.
Zookeeper (Name Service) stores service information under a hierarchical path, for example:
/mysql/3307/instance1 -> 192.168.10.21:3306
/mysql/3307/instance2 -> 192.168.10.22:3306The port number (3307) is used as a unique identifier for the cluster; any naming convention may be applied.
MZAgent is a lightweight Java agent deployed on every application server. It subscribes to Zookeeper nodes, resolves the logical name to an IP, and writes the mapping into /etc/hosts. When Zookeeper data changes, the agent rewrites the hosts file instantly.
ZKClient Functions Used by MZAgent
subscribeChildChanges()– watches for addition or removal of child nodes (e.g., cluster scaling). subscribeDataChanges() – watches leaf‑node value changes (IP updates) and triggers hosts‑file rewrite.
Service Registration Flow
MHA starts monitoring the MySQL cluster.
It writes a node under /mysql/<port>/<instance> containing the instance’s IP and role.
MZAgent on each app server subscribes to the parent path, resolves the logical name, and populates /etc/hosts with entries such as mysql-prod 192.168.10.21.
MySQL Failover Flow
When the master fails, MHA promotes a standby, updates the Zookeeper node with the new master’s IP, and publishes the change. MZAgent receives the notification, rewrites /etc/hosts, and the application reconnects using the same logical name without any code change.
Benefits of the Naming‑Service Architecture
No single point of failure for the naming layer – loss of an MZAgent does not affect MySQL availability.
Avoids VIP split‑brain scenarios.
Simplifies management of multiple clusters on a single host; each cluster is identified by a unique Zookeeper path.
Eliminates wasted IP addresses.
Operational Challenges Discovered
Scaling the MySQL cluster requires updating /etc/hosts on every application server, which is error‑prone.
Manual edits to the hosts file can break DB connectivity.
Different business units may need custom agent configurations, increasing maintenance overhead.
DNS‑Based Refinement
To remove host‑file management, the team replaced MZAgent with an internal DNS service (Dnsmasq). Each MySQL logical name is added as a DNS record that resolves to the current master IP. Short TTL values (3–5 seconds) or explicit cache purges guarantee rapid propagation of failover updates.
Using DNS eliminates the need for per‑host agents, provides built‑in load‑balancing for read replicas (by returning a random IP from a pool), and centralises configuration.
Key Considerations for DNS Deployment
Configure a very short TTL (e.g., 3 s) to ensure failover records are refreshed quickly.
Optionally purge the DNS cache programmatically during a switchover to guarantee immediate consistency.
Assign a unique internal domain name to each MySQL cluster (e.g., db-prod.example.internal).
Conclusion
The combination of MHA and a naming service (first Zookeeper, later internal DNS) removes the VIP‑related single point of failure and split‑brain problems, simplifies multi‑instance management, and conserves IP resources. Operational experience shows that host‑file based approaches introduce maintenance risk, which can be mitigated by moving to a DNS‑based resolution layer with short TTLs and cache‑purge mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
