Big Data 24 min read

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

This comprehensive guide walks you through constructing a fault‑tolerant HDFS high‑availability architecture, configuring dual NameNodes with ZooKeeper and JournalNode clusters, fine‑tuning YARN resource schedulers, implementing monitoring, automated failover testing, and performance optimization, all backed by real‑world production experiences and code examples.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

Introduction: Why your big data platform fails at critical moments?

At 3 am a colleague calls: the HDFS NameNode is down and the whole data platform is unavailable, causing real‑time reports to stop. The article shares a proven HA and YARN optimization solution that raised platform availability from 99.5% to 99.99% and reduced annual downtime from 43 hours to less than one hour.

1. HDFS High Availability Architecture: Ensure your data never lost

1.1 Traditional HDFS single point of failure

In the classic setup a single NameNode manages the namespace and client access, making the NameNode a critical single point of failure.

1.2 HA Architecture Overview

┌─────────────────┐
               │   ZooKeeper     │
               │    Cluster      │
               │  (3-5 nodes)    │
               └────────┬────────┘
                        │
          ┌─────────────┴─────────────┐
          │                           │
   ┌──────▼───────┐          ┌───────▼───────┐
   │ Active NN    │          │ Standby NN    │
   │ 10.0.1.10    │◄─────►   │ 10.0.1.11    │
   └───────┬─────┘          └───────────────┘
           │                           │
           │      JournalNode Cluster  │
           │   ┌─────────────────────┐ │
           └──►│ 10.0.1.20            │◄─┘
               │ 10.0.1.21            │
               │ 10.0.1.22            │
               └─────────────────────┘
                        │
               ┌────────▼────────┐
               │   DataNodes      │
               │   (N nodes)      │
               └─────────────────┘

The core design includes:

Dual NameNode hot standby : Active NN handles client requests, Standby NN synchronizes metadata.

JournalNode cluster : Guarantees metadata consistency and persistence.

ZooKeeper cluster : Provides fault detection and automatic failover.

ZKFC process : Monitors NN health and triggers the active‑standby switch.

1.3 Practical HA configuration

Step 1: Configure

hdfs-site.xml
<configuration>
  <!-- Specify HDFS nameservice -->
  <property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
  </property>

  <!-- Define both NameNode IDs -->
  <property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value>
  </property>

  <!-- RPC address for nn1 -->
  <property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>node1:8020</value>
  </property>

  <!-- RPC address for nn2 -->
  <property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>node2:8020</value>
  </property>

  <!-- JournalNode quorum -->
  <property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://node1:8485;node2:8485;node3:8485/mycluster</value>
  </property>

  <!-- Failover proxy provider -->
  <property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>

  <!-- SSH fencing -->
  <property>
    <name>dfs.ha.fencing.methods</name>
    <value>
      sshfence
      shell(/bin/true)
    </value>
  </property>

  <!-- Enable automatic failover -->
  <property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
</configuration>

Step 2: Configure

core-site.xml
<configuration>
  <!-- Default filesystem -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://mycluster</value>
  </property>

  <!-- ZooKeeper quorum -->
  <property>
    <name>ha.zookeeper.quorum</name>
    <value>node1:2181,node2:2181,node3:2181</value>
  </property>

  <!-- Hadoop temporary directory -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/data/hadoop/tmp</value>
  </property>
</configuration>

1.4 Failover testing and tuning

Test scenario 1: normal switch

# View current Active NN
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2

# Manual switch
hdfs haadmin -transitionToStandby nn1
hdfs haadmin -transitionToActive nn2

# Verify states
hdfs haadmin -getServiceState nn1  # should show standby
hdfs haadmin -getServiceState nn2  # should show active

Test scenario 2: simulate failure

# Kill the active NameNode process
jps | grep NameNode | awk '{print $1}' | xargs kill -9

# Simulate network outage
iptables -A INPUT -p tcp --dport 8020 -j DROP
iptables -A OUTPUT -p tcp --sport 8020 -j DROP

# Monitor automatic failover logs
tail -f /var/log/hadoop/hdfs/hadoop-hdfs-zkfc-*.log

Key tuning parameters

<!-- Reduce health‑monitor timeout -->
<property>
  <name>dfs.ha.zkfc.health-monitor.rpc-timeout.ms</name>
  <value>5000</value>
</property>

<!-- Increase JournalNode sync timeout -->
<property>
  <name>dfs.qjournal.write-txns.timeout.ms</name>
  <value>60000</value>
</property>

<!-- Client failover attempts -->
<property>
  <name>dfs.client.failover.max.attempts</name>
  <value>15</value>
</property>

<!-- Failover sleep base -->
<property>
  <name>dfs.client.failover.sleep.base.millis</name>
  <value>500</value>
</property>

2. YARN Resource Scheduling: Make every compute unit count

Common challenges in large Hadoop clusters include low CPU utilization (<40%), long task wait times, uneven queue distribution, and chaotic priority handling.

2.1 Scheduler comparison

FIFO Scheduler : Simple, no extra overhead, but low utilization and no multi‑tenant support.

Capacity Scheduler : Guarantees resources and supports elastic sharing, but configuration is complex.

Fair Scheduler : Dynamic load balancing and flexible configuration, yet can cause resource fragmentation.

Based on extensive production experience, about 90 % of environments should adopt the Capacity Scheduler for the best isolation and guarantees.

2.2 Capacity Scheduler best practice

Example queue hierarchy for a financial company handling real‑time risk control, batch reports, and ad‑hoc queries:

root (100%)
├── production (70%)
│   ├── realtime (30%)   # real‑time risk control, minimum resources
│   ├── batch (25%)      # offline reports, scheduled jobs
│   └── adhoc (15%)      # on‑demand queries, elastic scaling
├── development (20%)
│   ├── dev-team1 (10%)
│   └── dev-team2 (10%)
└── maintenance (10%)   # ops tasks

Key capacity-scheduler.xml settings:

<configuration>
  <!-- Root queues -->
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>production,development,maintenance</value>
  </property>

  <!-- Production queue -->
  <property>
    <name>yarn.scheduler.capacity.root.production.capacity</name>
    <value>70</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.production.maximum-capacity</name>
    <value>90</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.production.queues</name>
    <value>realtime,batch,adhoc</value>
  </property>

  <!-- Realtime sub‑queue -->
  <property>
    <name>yarn.scheduler.capacity.root.production.realtime.capacity</name>
    <value>30</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.production.realtime.maximum-capacity</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.production.realtime.user-limit-factor</name>
    <value>2</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.production.realtime.priority</name>
    <value>1000</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.production.realtime.disable-preemption</name>
    <value>false</value>
  </property>
</configuration>

2.3 Advanced scheduling strategies

Node‑label based scheduling for heterogeneous clusters:

# Add node labels
yarn rmadmin -addToClusterNodeLabels "GPU,SSD,MEMORY"

# Assign labels to nodes
yarn rmadmin -replaceLabelsOnNode "node1=GPU node2=GPU"
yarn rmadmin -replaceLabelsOnNode "node3=SSD node4=SSD"
yarn rmadmin -replaceLabelsOnNode "node5=MEMORY node6=MEMORY"

# Configure queue to use SSD label only
<property>
  <name>yarn.scheduler.capacity.root.production.realtime.accessible-node-labels</name>
  <value>SSD</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.production.realtime.accessible-node-labels.SSD.capacity</name>
  <value>100</value>
</property>

Dynamic resource pool script (Python) that adjusts queue capacities based on time of day and current load:

# dynamic_resource_manager.py (excerpt)
import subprocess, json
from datetime import datetime

class DynamicResourceManager:
    def __init__(self):
        self.peak_hours = [(8,12),(14,18)]
        self.off_peak_hours = [(0,6),(22,24)]

    def get_current_load(self):
        # Query YARN metrics, return dict like {'realtime': 85, 'batch': 45}
        pass

    def adjust_queue_capacity(self, queue, capacity):
        # Update capacity‑scheduler.xml and refresh queues
        subprocess.run(f"yarn rmadmin -refreshQueues", shell=True)

    def auto_scale(self):
        hour = datetime.now().hour
        load = self.get_current_load()
        if any(start <= hour < end for start, end in self.peak_hours):
            if load.get('realtime',0) > 80:
                self.adjust_queue_capacity('root.production.realtime', 50)
                self.adjust_queue_capacity('root.production.batch', 15)
        else:
            self.adjust_queue_capacity('root.production.realtime', 20)
            self.adjust_queue_capacity('root.production.batch', 40)

if __name__ == "__main__":
    DynamicResourceManager().auto_scale()

3. Monitoring and Fault Diagnosis: Detection matters more than fixing

A layered monitoring system should cover application, service, system, and infrastructure layers.

┌─────────────────────────────────────┐
│          Application Layer          │
│ (Job success rate / latency / usage)│
├─────────────────────────────────────┤
│           Service Layer             │
│ (NN/RM/DN/NM health)               │
├─────────────────────────────────────┤
│            System Layer             │
│ (CPU/memory/disk/network)          │
├─────────────────────────────────────┤
│        Infrastructure Layer         │
│ (datacenter/network/power)          │
└─────────────────────────────────────┘

Key Prometheus alerts for HDFS and YARN include NameNode down, high HDFS capacity usage, DataNode disk failures, YARN queue blockage, and memory pressure.

# prometheus-alerts.yml (excerpt)
 groups:
 - name: hdfs_alerts
   rules:
   - alert: NameNodeDown
     expr: up{job="namenode"} == 0
     for: 1m
     labels:
       severity: critical
     annotations:
       summary: "NameNode {{ $labels.instance }} is down"
   - alert: HDFSCapacityHigh
     expr: hdfs_cluster_capacity_used_percent > 85
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "HDFS capacity usage is {{ $value }}%"
 - name: yarn_alerts
   rules:
   - alert: YarnQueueBlocked
     expr: yarn_queue_pending_applications > 100
     for: 10m
     labels:
       severity: warning
     annotations:
       summary: "Queue {{ $labels.queue }} has {{ $value }} pending applications"
   - alert: YarnMemoryPressure
     expr: yarn_cluster_memory_used_percent > 90
     for: 5m
     labels:
       severity: critical
     annotations:
       summary: "YARN cluster memory usage is {{ $value }}%"

Typical failure playbooks cover NameNode safe‑mode recovery, YARN application hangs, and corrupt block handling, with step‑by‑step commands and automated remediation scripts.

# NameNode safe mode
hdfs dfsadmin -safemode get
hdfs dfsadmin -report
# Force exit safe mode (use with caution)
hdfs dfsadmin -safemode leave

# YARN application troubleshooting
yarn application -list
yarn application -status <app_id>
yarn logs -applicationId <app_id>
yarn application -kill <app_id>
yarn rmadmin -refreshQueues

4. Automation Toolchain

Examples include Bash health‑check scripts, Python capacity predictors, and dynamic scaling utilities that integrate with mail or enterprise messaging for alerts.

# hdfs_health_check.sh (excerpt)
LOG_FILE="/var/log/hdfs_health_check_$(date +%Y%m%d).log"

log_info() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] INFO: $1" | tee -a $LOG_FILE; }
log_error() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] ERROR: $1" | tee -a $LOG_FILE; }

check_namenode() { ... }
check_hdfs_capacity() { ... }
check_corrupt_blocks() { ... }
send_alert() { echo "$1" | mail -s "HDFS Alert" [email protected]; }

main() { log_info "Starting HDFS health check..."; check_namenode; check_hdfs_capacity; check_corrupt_blocks; log_info "Health check completed"; }
main

Conclusion

By prioritizing high availability, proactive monitoring, and automation, you can transform a fragile big‑data platform into a resilient service that rarely requires manual intervention.

high availabilityresource schedulingYaRNhdfsBig Data Operations
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.