Big Data 8 min read

Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

This article explains the new HDFS disk balancer feature introduced in Hadoop 3, covering its purpose, supported volume‑selection policies, step‑by‑step usage, planning and execution commands, and how it helps maintain balanced storage across DataNode disks.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

1. Introduction

HDFS now includes a comprehensive storage capacity management method for moving data across nodes, released in CDH 5.8.2 and later. DataNode stores blocks in local file system directories specified by

dfs.datanode.data.dir

. Typically each directory (volume) resides on a separate device such as HDD or SSD.

When writing new blocks, DataNode selects a disk using a volume‑selection policy. Two policies are supported:

Round‑robin

Available space (HDFS‑1804)

The round‑robin policy distributes new blocks evenly across available disks, while the available‑space policy prefers disks with the highest free space percentage.

By default, DataNode uses round‑robin, but long‑running clusters can become imbalanced due to large file deletions or adding new disks. Even the available‑space policy may still lead to inefficient I/O.

To address this, the Apache Hadoop community developed an online disk balancer (HDFS‑1312) that rebalances volumes on a running DataNode without taking it offline.

2. How to Use the Disk Balancer

First, ensure

dfs.disk.balancer.enabled

is set to

true

on all DataNodes. In CDH 5.8.2+, this can be configured via the HDFS section in Cloudera Manager.

Example scenario: a new disk

/mnt/disk1

is added and mounted as

/mnt/disk2

. Each HDFS data directory resides on a separate disk, which can be verified with

df

.

The disk balancer workflow consists of three steps executed via the

hdfs diskbalancer

command: plan, execute, and query.

During planning, the HDFS client reads DataNode information from the NameNode to generate a JSON plan file that lists source and target volumes and the amount of data to move. The default planner is GreedyPlanner , which moves data from the most used device to the least used until distribution is even.

Users can set a space‑utilization threshold; if the difference is below the threshold, the planner considers the disks balanced. An optional

--bandwidth

flag limits I/O impact.

The generated plan is stored under

/system/diskbalancer

. To execute the plan on a DataNode, run:

<code>hdfs diskbalancer -execute -plan /system/diskbalancer/plan.json</code>

This submits the JSON plan to the DataNode, which runs it in a background

BlockMover

thread.

To check the task status, use the query command:

<code>hdfs diskbalancer -query</code>

The output

PLAN_DONE

indicates completion. Verify effectiveness by running

df -h

again; the disk usage difference should be reduced to below 10%.

3. Summary

With the internal DataNode disk balancer introduced in HDFS‑1312, CDH 5.8.2+ provides a full storage capacity management solution that supports three types of data movement: across nodes (balancer), across storage types (Mover), and between disks within a single DataNode (disk balancer).

4. Acknowledgements

HDFS‑1312 was developed by Anu, Zhou Xiaobin, and Arpit Agarwal from Hortonworks, together with Lei (Eddy) Xu and Manoj Govindasamy from Cloudera.

big dataHDFSHadoopStorage ManagementDisk Balancer
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.