How to Achieve YARN ResourceManager High Availability with Zookeeper
YARN’s ResourceManager is a single point of failure, so this guide explains how to configure active/standby mode using Zookeeper, covering leader election, automatic failover, handling false‑dead scenarios, and the essential Zookeeper features such as temporary nodes, watches, and ACLs.
Problem Description
Hadoop includes a distributed scheduling framework called YARN , which is fundamental for supporting multiple compute models and resource scheduling.
Below is the YARN architecture diagram (illustrative image omitted).
The key point is that the middle ResourceManager is a single point of failure.
From the diagram, the ResourceManager plays a critical role in managing and allocating all cluster resources and must be highly available.
Solution
The official architecture shows the solution:
active/standby mode + Zookeeper
In active/standby mode, multiple ResourceManagers run; one is in the active state while the others are in standby, essentially a master‑slave setup.
This raises two questions:
How to elect the active ResourceManager?
How to perform master‑standby failover?
Implementation
Leader Election
When each ResourceManager starts, it creates an ephemeral node in Zookeeper, e.g., /YarnActiveResourceManager.
Zookeeper guarantees that only one client can successfully create this node; that ResourceManager becomes the active one, and the rest become standby.
Failover
Standby ResourceManagers register a watch on the /YarnActiveResourceManager node.
When the /YarnActiveResourceManager node is deleted (because the active ResourceManager crashed and its ephemeral node disappears), Zookeeper notifies the standby ResourceManagers, which then attempt to create the node again, electing a new active ResourceManager.
False‑Dead Issue
If the active ResourceManager becomes overloaded, it may appear to be dead (a "false‑dead" situation). Zookeeper may think it has failed and delete its temporary node, prompting standby managers to elect a new active manager.
When the original active manager recovers, it still believes it is the master and may attempt to modify data, causing a conflict with the newly elected active manager.
Resolution
The solution is to use Zookeeper's ACL (access control list) mechanism when creating the /YarnActiveResourceManager node, embedding credentials so that only the creator can access it, effectively locking the node.
If the false‑dead manager comes back, it will detect that the lock has changed and will switch itself to standby.
Summary
Key Zookeeper features used:
Unique node creation – only one client succeeds when multiple try to create the same path.
Ephemeral nodes – they disappear automatically when the client session ends.
Watchers – clients can listen for node changes and react accordingly.
Node permissions – ACLs can restrict access to nodes based on credentials.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
