Big Data 13 min read

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

This article explains HDFS reliability features such as replica policies, rack awareness, heartbeat, safe mode, checksums, trash, metadata protection and snapshots, then details YARN failover handling for ApplicationMaster, NodeManager and ResourceManager, and finally describes the Hadoop MapReduce shuffle workflow and tuning tips.

Big Data Technology & Architecture

Sep 17, 2021

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

You should first read the related series "Hadoop Key Difficulties: HDFS Read/Write, NN, 2NN, DN".

HDFS – Reliability

HDFS reliability mainly includes the following aspects:

Replica redundancy strategy

Rack awareness strategy

Heartbeat mechanism

Safe mode

Checksum

Trash

Metadata protection

Snapshot mechanism

1. Replica Redundancy Strategy

The replication factor can be set in hdfs-site.xml to specify the number of replicas.

All data blocks can be replicated.

When a DataNode starts, it scans the local file system, builds a block‑to‑file mapping (block report) and reports it to the NameNode.

2. Rack Awareness Strategy

HDFS detects whether two nodes are on the same rack by exchanging a small packet.

Typically one replica is placed on the local rack and another on a different rack to avoid data loss from rack failure and improve bandwidth utilization.

3. Heartbeat Mechanism

NameNode periodically receives heartbeat messages and block reports from DataNodes.

Based on the reports, NameNode validates metadata.

DataNodes that miss heartbeats are marked dead and will no longer receive I/O requests.

If a DataNode failure reduces replica count below the configured threshold, NameNode detects the shortage and triggers re‑replication at an appropriate time.

Re‑replication can also be caused by corrupted replicas, disk errors, or an increased replication factor.

4. Safe Mode

When the NameNode starts, it first enters a "safe mode" phase.

No data writes occur during safe mode.

During this phase, the NameNode collects reports from all DataNodes; once a sufficient number of blocks reach the minimum replica count, they are considered "safe".

After a configurable proportion of blocks are marked safe and a waiting period elapses, safe mode ends.

If blocks are found with insufficient replicas, they are re‑replicated until the minimum replica count is met.

5. Checksum

When a file is created, each block generates a checksum.

The checksum is stored as a hidden file in the namespace.

Clients can verify the checksum to detect corrupted blocks.

If a block is corrupted, the client can read from another replica.

6. Trash

Deleted files are moved to the /trash directory.

Files in the trash can be quickly restored.

A configurable retention time determines when trash files are permanently deleted and their space reclaimed.

7. Metadata Protection

The image file and edit log are core NameNode data and can be configured to have multiple replicas.

Having replicas improves safety but may reduce NameNode performance.

The NameNode remains a single point of failure; manual failover is required if it crashes.

YARN – Failover

Failure Types

Application issues

Process crashes

Hardware problems

Failure Handling

Task failure

Runtime exceptions or JVM exits are reported to the ApplicationMaster.

Heartbeats detect hung tasks (timeout); multiple checks (configurable) are performed before marking a task as failed.

If a job's task failure rate exceeds the configured threshold, the job is considered failed.

Failed tasks or jobs are relaunched by a new ApplicationMaster.

ApplicationMaster failure

The ApplicationMaster sends periodic heartbeats to the ResourceManager; after a configurable number of missed heartbeats, it is considered failed.

When the ApplicationMaster fails, the ResourceManager starts a new ApplicationMaster.

The new ApplicationMaster recovers the previous state (e.g., yarn.app.mapreduce.am.job.recovery.enable=true) by loading saved state from shared storage; the ResourceManager does not handle task state persistence.

The client also polls the ApplicationMaster for progress; upon detecting failure, it asks the ResourceManager for the new ApplicationMaster.

NodeManager failure

NodeManagers send heartbeats to the ResourceManager; if a heartbeat is missed for a configured interval, the ResourceManager removes the NodeManager.

Tasks and ApplicationMasters running on the failed NodeManager are recovered on other NodeManagers.

If a NodeManager fails repeatedly, the ApplicationMaster may blacklist it (ResourceManager does not perform blacklisting).

ResourceManager failure

State is periodically checkpointed to disk; upon failure, the ResourceManager restarts from the checkpoint.

High availability is achieved via ZooKeeper synchronization of state.

Overall, lower‑level modules are monitored and recovered via heartbeats, while the top‑level module uses checkpointing, state sync, and ZooKeeper for HA.

Hadoop Shuffle

MapReduce – Shuffle

Map results are sorted and transferred to Reduce for processing. Instead of writing directly to disk, Map uses an in‑memory buffer for pre‑sorting, may invoke a Combiner, compress data, partition by key, and minimize the amount transferred. Each Map task notifies the framework when it finishes, allowing Reduce to start.

Map Side

Map output is first cached in memory for sorting before being written to disk.

Each Map task has a circular memory buffer (default 100 MB); when 80 % full, a background thread spills the buffer to a file while the Map continues producing output. If the buffer fills completely, the Map must wait.

Spill files are written using a round‑robin approach after partitioning data by Reduce. Within each partition, records are sorted by key; if a Combiner is configured, it runs after sorting to reduce data size.

When the buffer threshold is reached, a spill file is created; at the end of the Map task many spill files may exist and are merged and sorted. If more than three files exist, an additional Combiner run occurs.

If compression is enabled, the final spill file is compressed before being written.

When the Map finishes, it notifies the task manager, allowing the Reduce phase to begin copying the results.

Reduce Side

Map output files reside on the local disks of the machines where the Map tasks ran.

If a Map's output is small, it stays in memory; otherwise it is written to disk.

A background thread merges and sorts these files into a larger file (decompressing if necessary).

After all Map outputs are copied and merged, the Reduce function is invoked.

Reduce output is written back to HDFS.

Optimization Tips

Allocate as much memory as possible to shuffle while ensuring Map and Reduce tasks have sufficient memory.

For Map, avoid disk writes by using a Combiner and increasing io.sort.mb.

For Reduce, keep Map results in memory as much as possible; adjust mapred.inmem.merge.threshold and mapred.job.reduce.input.buffer.percent as needed.

Monitor the "Spilled Records" counter to track how much data is written to disk (includes both Map and Reduce).

Enable compression for Map output and increase the I/O buffer size ( io.file.buffer.size, default 4 KB) to improve performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Reliability MapReduce YARN HDFS Shuffle

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.