Big Data 5 min read

How to Achieve YARN ResourceManager High Availability with Zookeeper

YARN’s ResourceManager is a single point of failure, so this guide explains how to configure active/standby mode using Zookeeper, covering leader election, automatic failover, handling false‑dead scenarios, and the essential Zookeeper features such as temporary nodes, watches, and ACLs.

Java High-Performance Architecture

Nov 23, 2016

How to Achieve YARN ResourceManager High Availability with Zookeeper

Problem Description

Hadoop includes a distributed scheduling framework called YARN , which is fundamental for supporting multiple compute models and resource scheduling.

Below is the YARN architecture diagram (illustrative image omitted).

The key point is that the middle ResourceManager is a single point of failure.

From the diagram, the ResourceManager plays a critical role in managing and allocating all cluster resources and must be highly available.

Solution

The official architecture shows the solution:

active/standby mode + Zookeeper

In active/standby mode, multiple ResourceManagers run; one is in the active state while the others are in standby, essentially a master‑slave setup.

This raises two questions:

How to elect the active ResourceManager?

How to perform master‑standby failover?

Implementation

Leader Election

When each ResourceManager starts, it creates an ephemeral node in Zookeeper, e.g., /YarnActiveResourceManager.

Zookeeper guarantees that only one client can successfully create this node; that ResourceManager becomes the active one, and the rest become standby.

Failover

Standby ResourceManagers register a watch on the /YarnActiveResourceManager node.

When the /YarnActiveResourceManager node is deleted (because the active ResourceManager crashed and its ephemeral node disappears), Zookeeper notifies the standby ResourceManagers, which then attempt to create the node again, electing a new active ResourceManager.

False‑Dead Issue

If the active ResourceManager becomes overloaded, it may appear to be dead (a "false‑dead" situation). Zookeeper may think it has failed and delete its temporary node, prompting standby managers to elect a new active manager.

When the original active manager recovers, it still believes it is the master and may attempt to modify data, causing a conflict with the newly elected active manager.

Resolution

The solution is to use Zookeeper's ACL (access control list) mechanism when creating the /YarnActiveResourceManager node, embedding credentials so that only the creator can access it, effectively locking the node.

If the false‑dead manager comes back, it will detect that the lock has changed and will switch itself to standby.

Summary

Key Zookeeper features used:

Unique node creation – only one client succeeds when multiple try to create the same path.

Ephemeral nodes – they disappear automatically when the client session ends.

Watchers – clients can listen for node changes and react accordingly.

Node permissions – ACLs can restrict access to nodes based on credentials.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ZooKeeper YARN ResourceManager

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.