Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process
This article explains HDFS reliability features such as replica policies, rack awareness, heartbeat, safe mode, checksums, trash, metadata protection and snapshots, then details YARN failover handling for ApplicationMaster, NodeManager and ResourceManager, and finally describes the Hadoop MapReduce shuffle workflow and tuning tips.
You should first read the related series "Hadoop Key Difficulties: HDFS Read/Write, NN, 2NN, DN".
HDFS – Reliability
HDFS reliability mainly includes the following aspects:
Replica redundancy strategy
Rack awareness strategy
Heartbeat mechanism
Safe mode
Checksum
Trash
Metadata protection
Snapshot mechanism
1. Replica Redundancy Strategy
The replication factor can be set in hdfs-site.xml to specify the number of replicas.
All data blocks can be replicated.
When a DataNode starts, it scans the local file system, builds a block‑to‑file mapping (block report) and reports it to the NameNode.
2. Rack Awareness Strategy
HDFS detects whether two nodes are on the same rack by exchanging a small packet.
Typically one replica is placed on the local rack and another on a different rack to avoid data loss from rack failure and improve bandwidth utilization.
3. Heartbeat Mechanism
NameNode periodically receives heartbeat messages and block reports from DataNodes.
Based on the reports, NameNode validates metadata.
DataNodes that miss heartbeats are marked dead and will no longer receive I/O requests.
If a DataNode failure reduces replica count below the configured threshold, NameNode detects the shortage and triggers re‑replication at an appropriate time.
Re‑replication can also be caused by corrupted replicas, disk errors, or an increased replication factor.
4. Safe Mode
When the NameNode starts, it first enters a "safe mode" phase.
No data writes occur during safe mode.
During this phase, the NameNode collects reports from all DataNodes; once a sufficient number of blocks reach the minimum replica count, they are considered "safe".
After a configurable proportion of blocks are marked safe and a waiting period elapses, safe mode ends.
If blocks are found with insufficient replicas, they are re‑replicated until the minimum replica count is met.
5. Checksum
When a file is created, each block generates a checksum.
The checksum is stored as a hidden file in the namespace.
Clients can verify the checksum to detect corrupted blocks.
If a block is corrupted, the client can read from another replica.
6. Trash
Deleted files are moved to the /trash directory.
Files in the trash can be quickly restored.
A configurable retention time determines when trash files are permanently deleted and their space reclaimed.
7. Metadata Protection
The image file and edit log are core NameNode data and can be configured to have multiple replicas.
Having replicas improves safety but may reduce NameNode performance.
The NameNode remains a single point of failure; manual failover is required if it crashes.
YARN – Failover
Failure Types
Application issues
Process crashes
Hardware problems
Failure Handling
Task failure
Runtime exceptions or JVM exits are reported to the ApplicationMaster.
Heartbeats detect hung tasks (timeout); multiple checks (configurable) are performed before marking a task as failed.
If a job's task failure rate exceeds the configured threshold, the job is considered failed.
Failed tasks or jobs are relaunched by a new ApplicationMaster.
ApplicationMaster failure
The ApplicationMaster sends periodic heartbeats to the ResourceManager; after a configurable number of missed heartbeats, it is considered failed.
When the ApplicationMaster fails, the ResourceManager starts a new ApplicationMaster.
The new ApplicationMaster recovers the previous state (e.g., yarn.app.mapreduce.am.job.recovery.enable=true) by loading saved state from shared storage; the ResourceManager does not handle task state persistence.
The client also polls the ApplicationMaster for progress; upon detecting failure, it asks the ResourceManager for the new ApplicationMaster.
NodeManager failure
NodeManagers send heartbeats to the ResourceManager; if a heartbeat is missed for a configured interval, the ResourceManager removes the NodeManager.
Tasks and ApplicationMasters running on the failed NodeManager are recovered on other NodeManagers.
If a NodeManager fails repeatedly, the ApplicationMaster may blacklist it (ResourceManager does not perform blacklisting).
ResourceManager failure
State is periodically checkpointed to disk; upon failure, the ResourceManager restarts from the checkpoint.
High availability is achieved via ZooKeeper synchronization of state.
Overall, lower‑level modules are monitored and recovered via heartbeats, while the top‑level module uses checkpointing, state sync, and ZooKeeper for HA.
Hadoop Shuffle
MapReduce – Shuffle
Map results are sorted and transferred to Reduce for processing. Instead of writing directly to disk, Map uses an in‑memory buffer for pre‑sorting, may invoke a Combiner, compress data, partition by key, and minimize the amount transferred. Each Map task notifies the framework when it finishes, allowing Reduce to start.
Map Side
Map output is first cached in memory for sorting before being written to disk.
Each Map task has a circular memory buffer (default 100 MB); when 80 % full, a background thread spills the buffer to a file while the Map continues producing output. If the buffer fills completely, the Map must wait.
Spill files are written using a round‑robin approach after partitioning data by Reduce. Within each partition, records are sorted by key; if a Combiner is configured, it runs after sorting to reduce data size.
When the buffer threshold is reached, a spill file is created; at the end of the Map task many spill files may exist and are merged and sorted. If more than three files exist, an additional Combiner run occurs.
If compression is enabled, the final spill file is compressed before being written.
When the Map finishes, it notifies the task manager, allowing the Reduce phase to begin copying the results.
Reduce Side
Map output files reside on the local disks of the machines where the Map tasks ran.
If a Map's output is small, it stays in memory; otherwise it is written to disk.
A background thread merges and sorts these files into a larger file (decompressing if necessary).
After all Map outputs are copied and merged, the Reduce function is invoked.
Reduce output is written back to HDFS.
Optimization Tips
Allocate as much memory as possible to shuffle while ensuring Map and Reduce tasks have sufficient memory.
For Map, avoid disk writes by using a Combiner and increasing io.sort.mb.
For Reduce, keep Map results in memory as much as possible; adjust mapred.inmem.merge.threshold and mapred.job.reduce.input.buffer.percent as needed.
Monitor the "Spilled Records" counter to track how much data is written to disk (includes both Map and Reduce).
Enable compression for Map output and increase the I/O buffer size ( io.file.buffer.size, default 4 KB) to improve performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
