Big Data 10 min read

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

This article explains how HDFS stores files in replicated data blocks, implements rack awareness to improve reliability and performance, shows the necessary configuration in core-site.xml, provides sample scripts, and demonstrates how to add new DataNode machines without restarting the NameNode.

Big Data Technology & Architecture

Apr 8, 2019

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

If the prince dies outside the northern city, a crippled youth from the Central Plains will pick up the Spring‑Autumn sword and call out, "I, Wen Bu Sheng, respectfully invite Tuoba Bodhisattva to die!"

Correction: In the previous article the NM in the department‑head analogy should have been RM. Thanks to DN for the correction.

1. Data Block

HDFS is the distributed file system of Hadoop, designed specifically for MapReduce, so besides the usual high reliability of a distributed file system, HDFS must also provide efficient read/write performance for MapReduce. How does HDFS achieve this?

First, HDFS stores each file as a series of data blocks, each of which has multiple replicas distributed across different nodes. This block‑level storage plus replication strategy is key to HDFS’s reliability and performance because:

1. After block storage, reading by block improves random‑read efficiency and concurrent‑read throughput.

2. Storing several replicas of each block on different machines ensures reliability and also increases concurrent‑read efficiency for the same block.

3. Data block division aligns well with MapReduce’s task‑splitting concept, and the replica placement policy is crucial for HDFS’s high reliability and performance.

2. Rack Awareness

HDFS uses a rack‑awareness strategy to improve data reliability, availability, and network‑bandwidth utilization. Through this process, the NameNode can determine the rack ID of each DataNode (which is why the NameNode stores DataNode information in a NetworkTopology data structure).

A simple, unoptimized policy places replicas on different racks, preventing data loss when an entire rack fails and allowing reads to take advantage of bandwidth from multiple racks. This policy distributes replicas evenly across the cluster, which helps balance load when components fail, but because a write operation must transfer data to multiple racks, the write cost increases.

In most cases the replication factor is 3. HDFS’s placement strategy stores one replica on the local rack, a second replica on another node in the same rack, and the third replica on a node in a different rack. This reduces inter‑rack data transfer, improving write efficiency. Since rack failures are far less common than node failures, this strategy does not affect data reliability or availability. Because replicas are stored on only two different racks, the total network bandwidth required for reads is also reduced. Under this strategy the replicas are not uniformly distributed across racks: one‑third of the replicas are on a single node, two‑thirds are on the same rack, and the remaining replicas are spread across the other racks, improving write performance without harming reliability or read performance.

3. Configuration

If rack awareness is not configured, the NameNode prints a log like the following:

2016-07-17 17:27:26,423 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.147.92:50010

Each IP’s rack ID is "/default-rack". Enabling rack awareness requires adding the following entries to core-site.xml:

<property>
  <name>topology.script.file.name</name>
  <value>/etc/hadoop/topology.sh</value>
</property>

The value points to a shell script that receives a DataNode’s IP and returns the corresponding rack ID. When the NameNode starts, it checks whether rack awareness is enabled; if so, it loads the script and, upon receiving a DataNode heartbeat, passes the IP to the script and stores the returned rack ID in an internal map. A simple script example is:

#!/bin/bash
HADOOP_CONF=etc/hadoop/config
while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec<${HADOOP_CONF}/topology.data
  result=""
  while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ]||[ "${ar[1]}" = "$nodeArg" ] ; then
      result="${ar[2]}"
    fi
  done
  shift
  if [ -z "${result}" ] ; then
    echo -n "/default-rack"
  else
    echo -n "${result}"
  fi
done

The topology.data file has the following format (node IP or hostname, switch, rack):

192.168.147.91 tbe192168147091 /dc1/rack1
192.168.147.92 tbe192168147092 /dc1/rack1
192.168.147.93 tbe192168147093 /dc1/rack2
192.168.147.94 tbe192168147094 /dc1/rack3
192.168.147.95 tbe192168147095 /dc1/rack3
192.168.147.96 tbe192168147096 /dc1/rack3

You can view the rack configuration with: ./hadoop dfsadmin -printTopology This command displays the rack information.

4. Dynamically Adding Nodes

How can a new DataNode be added to the cluster without restarting the NameNode? In a rack‑aware cluster you can do the following:

Assume the Hadoop cluster is deployed on 192.168.147.68 with both NameNode and DataNode, rack awareness is enabled, and bin/hadoop dfsadmin -printTopology shows:

Rack: /dc1/rack1
    192.168.147.68:50010 (dbj68)

Now you want to add a DataNode at 192.168.147.69 located in rack2 without restarting the NameNode. First, edit the NameNode’s topology.data file and add the new entry:

192.168.147.68 dbj68 /dc1/rack1
192.168.147.69 dbj69 /dc1/rack2

Then start the new DataNode with sbin/hadoop-daemons.sh start datanode. After starting, any node can run bin/hadoop dfsadmin -printTopology and see:

Rack: /dc1/rack1
    192.168.147.68:50010 (dbj68)
Rack: /dc1/rack2
    192.168.147.69:50010 (dbj69)

This shows that Hadoop has recognized the newly added node. If the new node’s entry is not added to topology.data, the DataNode logs will contain errors and the node will fail to start.