Big Data 6 min read

Why Did Multiple HDFS DataNodes Crash? Memory, GC, and Block Overload Explained

This article analyzes a midnight HDFS DataNode failure caused by excessive GC and OOM due to Spark batch jobs, examines how an unexpected surge in block count overloaded default memory settings, and presents concrete remediation steps and optimization recommendations to stabilize the cluster.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Why Did Multiple HDFS DataNodes Crash? Memory, GC, and Block Overload Explained

Problem Description

In the early hours, a DataNode (hadoop24) in an HDFS cluster went down, and four other DataNodes (hadoop1, hadoop5, hadoop21, hadoop22) appeared to be in a dead‑lock state: their processes were running but they failed to send heartbeats to the NameNode, which marked them offline.

HDFS DataNode status in WebUI
HDFS DataNode status in WebUI

Investigation Process

2.1 Direct Causes

(1) Log inspection of the failed DataNode showed frequent GC and OOM events just before the crash.

GC and OOM logs
GC and OOM logs

The initial hypothesis was that a Spark batch job running at night consumed excessive memory, affecting the DataNode. After restarting the DataNode around 10 am when resources were less constrained, the service recovered.

(2) About an hour after restart, the DataNode crashed again, this time without any Spark job running, indicating the issue stemmed from the DataNode’s own memory limits rather than resource sharing.

DataNode crash without Spark
DataNode crash without Spark

Further inspection of the HDFS Overview showed the total block count had surged to over 5.7 million (previous day 2 million), with a single DataNode holding more than 900 k blocks. This exceeded the default 1 GB memory allocation per DataNode, leading to the crashes.

Block count overload
Block count overload

Resolution: Increase NameNode and DataNode memory to 3 GB and reduce Spark Worker memory allocation on the same nodes, then restart the cluster.

Cluster after memory adjustment
Cluster after memory adjustment

2.2 Cause of Block Count Explosion

Comparing the original table with the cleaned‑data table for the same date, the block count jumped from 68 to 4 736. The Spark cleaning job read eight source tables and wrote each result into a single output table, creating many small files.

Spark job generating many small files
Spark job generating many small files

Solution: Union the eight RDDs, then perform a

repartition(32)

before writing to HDFS, which normalized the block count per partition.

Repartitioned Spark output
Repartitioned Spark output

After this change, the total block count returned to normal levels.

Normalized block count
Normalized block count

Optimization Recommendations

(1) Add monitoring metrics for HDFS block count.

(2) Align NameNode JVM parameters with the expected number of filesystem objects (files + blocks):

File object count 10,000,000:

-Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M

File object count 20,000,000:

-Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G

File object count 50,000,000:

-Xms32G -Xmx32G -XX:NewSize=3G -XX:MaxNewSize=3G

File object count 100,000,000:

-Xms64G -Xmx64G -XX:NewSize=6G -XX:MaxNewSize=6G

(3) Recommended JVM settings for a DataNode based on its average block count:

2,000,000 blocks:

-Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M

5,000,000 blocks:

-Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G

(4) When multiple tables write to the same result table in Spark, monitor the resulting block count and apply

repartition

if necessary before persisting.

Big Datamemory managementgarbage collectionHDFSSparkDataNodeBlock Overload
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.