Big Data 10 min read

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

WeiLi Technology Team

Nov 1, 2023

How to Diagnose and Resolve HDFS Safe Mode Issues

Problem Phenomenon

When a DataNode crashes, its blocks become corrupted and HDFS automatically switches to safe mode, which can be observed on the HDFS homepage as "Safe mode is ON".

What Is Safe Mode?

HDFS safe mode is a special read‑only state where the file system accepts only read requests; delete, modify, or block‑replication operations are blocked. The purpose is to guarantee data consistency and prevent data loss while the cluster stabilises.

How Safe Mode Is Entered

Passive entry – an administrator manually triggers safe mode, typically for maintenance or expansion, using the command hdfs dfsadmin -safemode enter and later exits with hdfs dfsadmin -safemode leave.

Active entry – the NameNode enters safe mode automatically during startup or when the cluster does not meet required safety thresholds. The system leaves safe mode only after several conditions are satisfied:

The number of live DataNodes meets the threshold defined by dfs.namenode.safemode.min.datanodes.

The percentage of blocks that have reached the minimum replication factor exceeds dfs.namenode.safemode.threshold-pct (default 0.999, i.e., 99.9%).

The minimum replication count per block meets dfs.namenode.replication.min (default 1).

After the above are met, the cluster must remain stable for the period set by dfs.namenode.safemode.extension (default 1 ms).

Typical direct causes for entering safe mode include:

Failed DataNode startup or loss of heartbeat to the NameNode.

Disk failures on DataNode storage volumes.

Disk partitions running out of space.

How to Solve

Analyze the cause

1. Check the HDFS Web UI for cluster and DataNode status. 2. Review logs (usually under /var/log/) for error details.

Fix the issue

Depending on the identified problem, take appropriate actions:

If a DataNode failed to start, repair and restart it.

If a disk partition is full, expand the storage.

If a storage volume is faulty, repair or replace it (note that data on the failed volume may be lost).

If data loss occurs, list corrupted blocks and their files with hdfs fsck / -list-corruptfileblocks or hdfs fsck / -files -blocks -locations, then delete the affected files using hdfs fsck / -delete after exiting safe mode:

Exit safe mode: sudo -u hdfs hdfs dfsadmin -safemode leave Delete corrupted files: sudo -u hdfs hdfs fsck / -delete After fixing or deleting the problematic blocks, restart the cluster; HDFS should exit safe mode and resume normal read/write operations.

Production Complete Process

All commands must be executed as the hdfs user (e.g., su - hdfs).

Leave safe mode: sudo -u hdfs hdfs dfsadmin -safemode leave Check cluster status: hdfs dfsadmin -report List corrupted blocks: hdfs fsck -list-corruptfileblocks Run a health check: hdfs fsck / Inspect specific corrupted blocks: hdfs fsck /path/to/corrupt/file -locations -blocks -files Delete bad blocks: hdfs fsck / -delete Verify health again; if still unhealthy, repeat after some time.

If blocks remain, manually remove their files: hdfs dfs -rm "/File/Path/of/the/missing/blocks" Following these steps should restore HDFS to a healthy state.

Summary

Maintain at least two replicas for production HDFS blocks.

Monitor DataNode disk usage and expand storage before thresholds are exceeded.

Always analyze the root cause before forcibly exiting safe mode.

If block corruption occurs, attempt replication recovery first; only delete irrecoverable blocks and restore data from upstream sources.

Modify safe‑mode parameters with caution; avoid changing them just to force an exit.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data cluster management HDFS Hadoop safe-mode

Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.