Operations 6 min read

Why Zookeeper Is Essential for Master/Slave High Availability in Distributed Systems

The article explains why Zookeeper is a crucial component for implementing master‑slave high‑availability architectures in distributed systems, detailing how temporary nodes, sequence numbers, and cluster coordination eliminate single points of failure and enable reliable failover for write services.

Ctrip Technology

Oct 28, 2014

Why Zookeeper Is Essential for Master/Slave High Availability in Distributed Systems

Chen Yehao, senior architect in the cruise vacation division, has over 15 years of software development experience, previously worked at the travel startup "Find Fun", uses Python for many years, is interested in open‑source technologies, and is currently promoting message‑driven asynchronous programming in multi‑core environments while researching Erlang, Golang, and F#.

Should we add the word "important" before "role" in the title? I hesitated; the functions provided by Zookeeper are so important that if you don't use it in your application, you'll eventually implement its functionality yourself, so Zookeeper is worth spending (a little) time mastering.

Zookeeper was created for "distributed" purposes; I repeatedly mention "distributed" not to chase trends but because the trend pushes us forward. In any internet production application, even if your company is small and a single server can handle traffic, you cannot tolerate having no standby server when a server fails, which is called "preventing single point of failure". Since you need at least two servers, you are already in a distributed service environment.

Recall the previous article about why to design read/write separation at the service layer; I divided generic services into "read" services (clustered for high availability and performance) and "write" services (single server to guarantee transaction order).

A single server sounds risky, so today the main character appears: we need Zookeeper.

You may have heard this scenario called master/slave, or I prefer primary/backup. In this scenario, I have two servers (primary and backup); only the primary works, the backup takes over when the primary fails. With Zookeeper, this process works as follows: Zookeeper provides directory and node services; when the two servers start, they create temporary nodes under a designated directory (registration). Temporary nodes are maintained by heartbeats; if a server fails to send heartbeats, Zookeeper deletes its temporary node. When servers register, Zookeeper assigns sequence numbers; the smaller number is the primary, larger is the backup.

When our client (usually a web server) needs to access the "write" service, it connects to Zookeeper, obtains the list of temporary nodes in the designated directory, gets the address of the server with the smallest sequence number (the primary), and proceeds with operations, ensuring it always accesses the primary server.

When the primary server fails, Zookeeper deletes its temporary node and can notify all interested clients of this change, efficiently and quickly propagating the information. Imagine implementing this yourself—it wouldn't be that simple.

The primary/backup mode we use to eliminate single points of failure depends on Zookeeper, so Zookeeper itself must avoid single points of failure; therefore Zookeeper was designed to run in a cluster, using multiple servers to eliminate its own single‑point‑failure risk.

In summary, in a multi‑core parallel computing model, I consider the message‑driven actor model (originating from Erlang) the correct programming approach; with the actor model, we can easily implement serial operations at the service layer to ensure write operation integrity and consistency. Using the actor model requires a primary/backup deployment to eliminate single points of failure, and the simplest reliable method is to use Zookeeper. Thus my software architecture derives from high‑concurrency demand → asynchronous computation (actor model) → master/slave (Zookeeper).

For more technical sharing, click "Read Original".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Zookeeper Master‑Slave failover

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.