Why Did Our HDFS Standby NameNode Crash? A Deep Dive into Block Recovery Bugs
A recent HDFS outage caused the Standby and Observer NameNodes to crash after heavy client load triggered block recovery failures, exposing a bug in commitBlockSynchronization that leads to mismatched block IDs and edit‑log inconsistencies, which can be fixed by applying HDFS‑17861.
During a production incident an HDFS cluster experienced a cascade of crashes: the Standby NameNode and the Observer NameNode both crashed shortly after a Trino task opened thousands of connections to several DataNodes, overloading them and causing loss of heartbeats.
Failure Chain Analysis
The overloaded DataNodes stalled, breaking heartbeats, block reports (IBR) and block receipt (FBR) with the NameNodes.
Because the NameNode could not receive block reports, some blocks were not closed properly, triggering block‑recovery and lease‑recovery processes.
During block recovery the last block group entered an unrecoverable state. The DataNode sent a commitBlockSynchronization RPC with the deleteBlock flag set to true, causing the problematic block group to be deleted. This behavior is documented in HDFS‑17358 (https://issues.apache.org/jira/browse/HDFS-17358).
The NameNode handled the commitBlockSynchronization request, deleted the block as instructed, and then attempted to close the file. Because the required replication factor was not satisfied, the close operation failed and no edit‑log entry was written. Consequently the Active NameNode recorded one fewer block than the Standby, creating a state mismatch.
The Active NameNode retried block recovery, succeeded, and wrote a CLOSE edit‑log entry.
The Standby NameNode fetched the new edit log from the JournalNode, applied it, detected the inconsistency, threw an exception, and crashed.
Bug Fix
The root cause was a bug in the handling of commitBlockSynchronization that allowed the deleteBlock flag to corrupt block state on standby/observer nodes. The bug was fixed in HDFS‑17861 (https://issues.apache.org/jira/browse/HDFS-17861). The patch is available in Apache Hadoop pull request #8120 (https://github.com/apache/hadoop/pull/8120).
Relevant Log Excerpts
2025-11-18 17:47:51,092 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/user/luca/data-pre/dc/Bee-Training-Data-Stage1/dir=data/.20251118161754_13.tgz.downloading, ...]
java.io.IOException: Mismatched block IDs or generation stamps, attempting to replace block blk_-9223372036154398096_120505068 with blk_-9223372036154398096_120522931 as block # 1/2 of /user/luca/data-pre/dc/Bee-Training-Data-Stage1/dir=data/.20251118161754_13.tgz.downloading
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1145)
...
2025-11-18 17:46:49,763 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(oldBlock=BP-1615086879-xxx-1719219336538:blk_-9223372036154398096_120505068, newgenerationstamp=120522931, newlength=125829120, newtargets=[xxx:50010, null:0, null:0, xxx:50010, xxx:50010, xxx:50010], closeFile=true, deleteBlock=false) successfulCode Path Inspection
The exception occurs when the Standby NameNode applies an OP_CLOSE edit log. The handling logic in NameNodeRpcServer is:
case OP_CLOSE: {
AddCloseOp addCloseOp = (AddCloseOp)op;
final String path = renameReservedPathsOnUpgrade(addCloseOp.path, logVersion);
final INodesInPath iip = fsDir.getINodesInPath(path, DirOp.READ);
final INodeFile file = INodeFile.valueOf(iip.getLastINode(), path);
// Update file attributes
file.setAccessTime(addCloseOp.atime, Snapshot.CURRENT_STATE_ID, false);
file.setModificationTime(addCloseOp.mtime, Snapshot.CURRENT_STATE_ID);
ErasureCodingPolicy ecPolicy = FSDirErasureCodingOp.unprotectedGetErasureCodingPolicy(fsDir.getFSNamesystem(), iip);
updateBlocks(fsDir, addCloseOp, iip, file, ecPolicy);
// Close the file
if (!file.isUnderConstruction() && logVersion <= LayoutVersion.BUGFIX_HDFS_2991_VERSION) {
throw new IOException("File is not under construction: " + path);
}
if (file.isUnderConstruction()) {
fsNamesys.getLeaseManager().removeLease(file.getId());
file.toCompleteFile(file.getModificationTime(), 0, fsNamesys.getBlockManager().getMinReplication());
}
break;
}The updateBlocks method throws the mismatched‑block‑ID exception when the old and new block IDs or generation stamps differ and the isGenStampUpdate flag is false. The flag is defined as:
boolean isGenStampUpdate = oldBlocks.length == newBlocks.length;In the incident the block list length stored in the Standby’s in‑memory INodeFile did not match the length carried by the OP_CLOSE operation, causing the exception and subsequent crash.
Conclusion
The crash was caused by a combination of extreme client load, incomplete block replication, and a bug in commitBlockSynchronization that mishandles the deleteBlock flag during recovery. Applying the HDFS‑17861 patch resolves the issue. Operators should monitor client connection counts and block replication health to prevent similar cascades.
Big Data Technology Tribe
Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
