How to Diagnose and Resolve HDFS Safe Mode Issues
This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.
Problem Phenomenon
When a DataNode crashes, its blocks become corrupted and HDFS automatically switches to safe mode, which can be observed on the HDFS homepage as "Safe mode is ON".
What Is Safe Mode?
HDFS safe mode is a special read‑only state where the file system accepts only read requests; delete, modify, or block‑replication operations are blocked. The purpose is to guarantee data consistency and prevent data loss while the cluster stabilises.
How Safe Mode Is Entered
Passive entry – an administrator manually triggers safe mode, typically for maintenance or expansion, using the command
hdfs dfsadmin -safemode enterand later exits with
hdfs dfsadmin -safemode leave.
Active entry – the NameNode enters safe mode automatically during startup or when the cluster does not meet required safety thresholds. The system leaves safe mode only after several conditions are satisfied:
The number of live DataNodes meets the threshold defined by
dfs.namenode.safemode.min.datanodes.
The percentage of blocks that have reached the minimum replication factor exceeds
dfs.namenode.safemode.threshold-pct(default 0.999, i.e., 99.9%).
The minimum replication count per block meets
dfs.namenode.replication.min(default 1).
After the above are met, the cluster must remain stable for the period set by
dfs.namenode.safemode.extension(default 1 ms).
Typical direct causes for entering safe mode include:
Failed DataNode startup or loss of heartbeat to the NameNode.
Disk failures on DataNode storage volumes.
Disk partitions running out of space.
How to Solve
Analyze the cause
1. Check the HDFS Web UI for cluster and DataNode status. 2. Review logs (usually under
/var/log/) for error details.
Fix the issue
Depending on the identified problem, take appropriate actions:
If a DataNode failed to start, repair and restart it.
If a disk partition is full, expand the storage.
If a storage volume is faulty, repair or replace it (note that data on the failed volume may be lost).
If data loss occurs, list corrupted blocks and their files with
hdfs fsck / -list-corruptfileblocksor
hdfs fsck / -files -blocks -locations, then delete the affected files using
hdfs fsck / -deleteafter exiting safe mode:
Exit safe mode:
sudo -u hdfs hdfs dfsadmin -safemode leaveDelete corrupted files:
sudo -u hdfs hdfs fsck / -deleteAfter fixing or deleting the problematic blocks, restart the cluster; HDFS should exit safe mode and resume normal read/write operations.
Production Complete Process
All commands must be executed as the
hdfsuser (e.g.,
su - hdfs).
Leave safe mode:
sudo -u hdfs hdfs dfsadmin -safemode leaveCheck cluster status:
hdfs dfsadmin -reportList corrupted blocks:
hdfs fsck -list-corruptfileblocksRun a health check:
hdfs fsck /Inspect specific corrupted blocks:
hdfs fsck /path/to/corrupt/file -locations -blocks -filesDelete bad blocks:
hdfs fsck / -deleteVerify health again; if still unhealthy, repeat after some time.
If blocks remain, manually remove their files:
hdfs dfs -rm "/File/Path/of/the/missing/blocks"Following these steps should restore HDFS to a healthy state.
Summary
Maintain at least two replicas for production HDFS blocks.
Monitor DataNode disk usage and expand storage before thresholds are exceeded.
Always analyze the root cause before forcibly exiting safe mode.
If block corruption occurs, attempt replication recovery first; only delete irrecoverable blocks and restore data from upstream sources.
Modify safe‑mode parameters with caution; avoid changing them just to force an exit.
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.