Big Data 14 min read

Mastering HDFS Monitoring on JD Cloud: Key Metrics, Tools, and Best Practices

This article presents a comprehensive guide to monitoring Hadoop Distributed File System (HDFS) on JD Cloud, covering challenges, recommended toolchains, essential metrics, configuration tips, and real‑world case studies to help engineers ensure reliability and performance of large‑scale data clusters.

dbaplus Community
dbaplus Community
dbaplus Community
Mastering HDFS Monitoring on JD Cloud: Key Metrics, Tools, and Best Practices

HDFS Monitoring Challenges

HDFS is part of the Hadoop ecosystem, so a monitoring solution must also cover YARN, HBase, Hive and other related services.

The HDFS JMX/HTTP API exposes a large number of metrics; only a subset need real‑time collection, but critical metrics must be instantly available during incidents.

Log collection from Hadoop components (NameNode, DataNode, JournalNode, etc.) is essential for troubleshooting and audit.

The monitoring stack should provide fault‑location metrics such as DataNode health, block replication status and rack awareness.

Monitoring Stack Selection

In practice the most widely used products (CDH, Ambari) cannot be customized for specific Hadoop versions and often do not scale to very large clusters. The chosen open‑source stack therefore consists of:

Metrics collection: HadoopExporter and the Hadoop HTTP JMX endpoint (e.g., http://{domain}:{port}/jmx).

Log aggregation: ELK (Elasticsearch + Logstash + Kibana) for global log search and keyword‑based error detection.

Time‑series storage: Prometheus.

Visualization: Grafana dashboards, the native HDFS UI and Hue.

Alerting: Integration with a cloud‑based alert system (e.g., JD Cloud alerts).

Key HDFS Monitoring Metrics

1. Overview of Main Metrics

HDFS main metrics overview
HDFS main metrics overview

2. Black‑Box Metrics

These metrics verify the end‑to‑end file lifecycle (create, read, modify, delete) and detect functional anomalies.

Write a timestamp into a test file and compare it on read; the time delta reveals write latency.

Ensure temporary test files are cleaned up, otherwise a large number of short‑lived files can exhaust NameNode heap.

3. White‑Box Metrics

Block‑related errors

MissingBlocks: Indicates block loss and potential file corruption. Monitor UnderReplicatedBlocks to anticipate missing‑block risk.

Unavailable DataNode ratio: A high ratio reduces the pool of healthy DataNodes for block placement. The selection logic (see BlockPlacementPolicyDefault.isGoodTarget) checks node liveliness and available space.

Log keyword monitoring

Search Hadoop logs for common exception keywords such as IOException, NoRouteToHostException, SafeModeException, UnknownHostException.

UnderReplicatedBlocks

Counts blocks that have fewer replicas than the configured replication factor, typically caused by DataNode failures or network partitions.

Full Garbage Collection (FGC)

Track FGC events; frequent full GCs indicate memory pressure on the NameNode JVM.

Read/Write Success Rate

Collected via monitor_write.status and monitor_read.status. These counters are used to compute SLA‑level success ratios.

Disk Failure (NumFailedVolumes)

Example: a 1 000‑node cluster with 12 disks per node yields 12 000 disks. With an average quarterly failure rate of 1.65 % (Backblaze data), roughly 7 disks fail per month. Automated detection and repair workflows are therefore required.

Capacity PercentUsed – overall space usage.

Be aware of “phantom” free space caused by decommissioned nodes whose capacity is still counted.

Reserve space using dfs.datanode.du.reserved, dfs.datanode.du.reserved.calculator or dfs.datanode.du.reserved.pct.

Formula:

Configured Capacity = Total Disk Space – Reserved Space = Remaining Space + DFS Used + Non‑DFS Used

.

NameNode Heap Usage

Metric: HeapMemoryUsage.used / HeapMemoryUsage.committed. High heap usage slows NameNode startup and raises FGC risk. Mitigation strategies include:

Increase the heap allocation.

Implement file‑lifecycle management to delete obsolete files.

Merge small files to reduce metadata overhead.

Deploy HDFS Federation for horizontal scaling.

Data Balance

Measured by the standard deviation of space usage across DataNodes. An imbalance can degrade performance and increase risk of data loss. Prior to Hadoop 3.0 the balancer only moved whole blocks between nodes, not between disks on the same node.

RPC Queue Length

Metric: CallQueueLength. A growing queue indicates back‑pressure on the NameNode RPC layer.

File Count (FilesTotal)

Each filesystem object consumes ~150 bytes of NameNode heap. Monitoring FilesTotal helps estimate when the NameNode will run out of memory.

Decommissioning DataNodes (NumDecommissioningDataNodes)

Tracking this metric helps plan capacity and avoid unnecessary cost in large clusters.

Generic server health

Include JVM metrics, CPU, memory, and health of dependent services such as Zookeeper and DNS.

Implementation Details

Grafana dashboards (custom HDFS template) are used for service inspection and fault location. The official HDFS Grafana template provides a limited set of metrics, so additional panels are added for the metrics listed above.

Grafana HDFS dashboard
Grafana HDFS dashboard

ELK‑Hadoop stack enables full‑text search across all Hadoop component logs and keyword‑based alerting.

ELK log search
ELK log search

Hue and the native HDFS UI provide interactive file‑system browsing and quick access to block‑placement diagnostics.

Case Studies

Case 1 – DNS Data Corruption Causing NameNode HA Failure

Detection: SLA metric anomalies and functional monitoring alerts.

Root cause: Corrupt DNS records produced wrong hostnames, causing HA failover to target an unreachable NameNode.

Remediation: Deploy a reliable internal DNS service (e.g., DNSMasq) and avoid ad‑hoc edits to /etc/hosts.

Case 2 – Improper Rack Grouping Preventing Writes

Detection: Intermittent write‑error alerts from functional monitoring.

Root cause: Rack awareness was enabled but rack groups were unevenly provisioned, exhausting storage on some racks and leaving no suitable DataNode for new blocks.

Remediation: Balance instance counts across racks, monitor rack‑group capacity, or disable rack awareness for small clusters.

Custom Monitoring Tasks

Repository with example HDFS monitoring jobs and scripts:

https://github.com/cloud-op/monitor

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataELKHDFSjmx
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.