Big Data 16 min read

Guide to Setting Up Hadoop High Availability (HA) Cluster with HDFS and YARN

This article provides a step‑by‑step tutorial on configuring Hadoop high availability, covering HDFS HA architecture, Quorum Journal Manager synchronization, NameNode failover, YARN HA, required pre‑conditions, cluster planning, configuration files, service startup, and verification procedures.

Architecture Digest
Architecture Digest
Architecture Digest
Guide to Setting Up Hadoop High Availability (HA) Cluster with HDFS and YARN

Guide to Hadoop High Availability Setup

Big data remains a hot field, and this guide walks through building a highly available Hadoop cluster using HDFS and YARN. The source material originates from the GitHub repository BigData-Notes by author heibaiying.

1. High Availability Overview

Hadoop HA consists of HDFS HA and YARN HA. HDFS HA is more complex because the NameNode must guarantee data consistency. The architecture includes Active and Standby NameNodes, a ZKFailoverController, a Zookeeper ensemble, shared storage (QJM or NFS), and DataNode reporting.

1.1 Overall HA Architecture

The HDFS HA architecture is illustrated below (image omitted). Key components:

Active NameNode and Standby NameNode – one serves read/write, the other is standby.

ZKFailoverController – runs as a separate process, monitors NameNode health via HAServiceProtocol RPC, and triggers failover using Zookeeper.

Zookeeper ensemble – provides leader election for failover.

Shared storage system – stores the EditLog and metadata; both NameNodes synchronize through it.

DataNode – reports block locations to both NameNodes.

1.2 Data Synchronization with QJM

Hadoop can use Quorum Journal Manager (QJM) or NFS as shared storage. With QJM, the Active NameNode writes EditLog to a JournalNode cluster, and the Standby NameNode periodically syncs it. The write succeeds when a majority of JournalNodes acknowledge, so at least three JournalNodes are required (odd number to tolerate failures).

1.3 NameNode Failover Process

The failover workflow includes:

HealthMonitor initializes and periodically calls HAServiceProtocol to check NameNode health.

On health change, HealthMonitor notifies ZKFailoverController.

ZKFailoverController triggers ActiveStandbyElector for automatic election.

ActiveStandbyElector interacts with Zookeeper to elect the new active.

After election, ActiveStandbyElector informs ZKFailoverController of the new role.

ZKFailoverController calls HAServiceProtocol to switch the NameNode to Active or Standby.

1.4 YARN High Availability

YARN ResourceManager HA works similarly but stores its state directly in Zookeeper, eliminating the need for extensive metadata replication.

2. Cluster Planning

To meet HA requirements, the cluster needs at least two NameNodes (active/standby), two ResourceManagers, and three JournalNodes. The plan uses three physical hosts (hadoop001, hadoop002, hadoop003).

3. Prerequisites

All servers must have JDK installed.

Zookeeper ensemble must be set up.

SSH password‑less login must be configured between all servers.

4. Cluster Configuration

4.1 Download and Extract

Download a CDH version of Hadoop, e.g., from http://archive.cloudera.com/cdh5/cdh/5/ and extract:

# tar -zvxf hadoop-2.6.0-cdh5.15.2.tar.gz

4.2 Set Environment Variables

Edit /etc/profile and add:

export HADOOP_HOME=/usr/app/hadoop-2.6.0-cdh5.15.2
export PATH=${HADOOP_HOME}/bin:$PATH

Apply the changes with source /etc/profile .

4.3 Modify Configuration Files

Update the following XML files under ${HADOOP_HOME}/etc/hadoop :

hadoop-env.sh – set export JAVA_HOME=/usr/java/jdk1.8.0_201/

core-site.xml – define fs.defaultFS and hadoop.tmp.dir .

hdfs-site.xml – configure replication, NameNode directories, HA nameservice, shared edits directory (QJM), fencing methods, and automatic failover.

yarn-site.xml – enable auxiliary services, log aggregation, enable RM HA, set cluster ID, RM IDs, hostnames, web UI addresses, Zookeeper address, and recovery store class.

mapred-site.xml – set mapreduce.framework.name=yarn .

slaves – list hostnames/IPs of all DataNode/NodeManager hosts (one per line).

4.4 Distribute Packages

Copy the Hadoop installation directory to the other two servers:

# scp -r /usr/app/hadoop-2.6.0-cdh5.15.2/ hadoop002:/usr/app/
# scp -r /usr/app/hadoop-2.6.0-cdh5.15.2/ hadoop003:/usr/app/

5. Start the Cluster

5.1 Start Zookeeper

zkServer.sh start

5.2 Start JournalNode

hadoop-daemon.sh start journalnode

5.3 Initialize NameNode

hdfs namenode -format

Copy the formatted metadata directory to the standby NameNode.

scp -r /home/hadoop/namenode/data hadoop002:/home/hadoop/namenode/

5.4 Initialize HA State in Zookeeper

hdfs zkfc -formatZK

5.5 Start HDFS

start-dfs.sh

5.6 Start YARN

start-yarn.sh

If the third ResourceManager is not started automatically, launch it manually:

yarn-daemon.sh start resourcemanager

6. Verify the Cluster

6.1 Check Processes

Run jps on each host; you should see NameNode, DataNode, JournalNode, ZKFailoverController, ResourceManager, NodeManager, etc.

6.2 Check Web UI

HDFS UI is available at http:// :50070 and YARN UI at http:// :8088 . The active NameNode and ResourceManager appear in green, while standby instances are shown as standby.

7. Restarting the Cluster

After the initial setup, subsequent restarts are simpler. Ensure Zookeeper is running, then start HDFS with start-dfs.sh , start YARN with start-yarn.sh , and manually start any missing ResourceManager as needed.

References

HDFS High Availability Using the Quorum Journal Manager – Apache Hadoop Docs

ResourceManager High Availability – Apache Hadoop Docs

IBM Developer article on Hadoop NameNode HA – IBM

Big Datahigh availabilityyarnHDFSHadoopCluster Setup
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.