Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition
This article explains how HDFS stores files in replicated data blocks, implements rack awareness to improve reliability and performance, shows the necessary configuration in core-site.xml, provides sample scripts, and demonstrates how to add new DataNode machines without restarting the NameNode.
If the prince dies outside the northern city, a crippled youth from the Central Plains will pick up the Spring‑Autumn sword and call out, "I, Wen Bu Sheng, respectfully invite Tuoba Bodhisattva to die!"
Correction: In the previous article the NM in the department‑head analogy should have been RM. Thanks to DN for the correction.
1. Data Block
HDFS is the distributed file system of Hadoop, designed specifically for MapReduce, so besides the usual high reliability of a distributed file system, HDFS must also provide efficient read/write performance for MapReduce. How does HDFS achieve this?
First, HDFS stores each file as a series of data blocks, each of which has multiple replicas distributed across different nodes. This block‑level storage plus replication strategy is key to HDFS’s reliability and performance because:
1. After block storage, reading by block improves random‑read efficiency and concurrent‑read throughput.
2. Storing several replicas of each block on different machines ensures reliability and also increases concurrent‑read efficiency for the same block.
3. Data block division aligns well with MapReduce’s task‑splitting concept, and the replica placement policy is crucial for HDFS’s high reliability and performance.
2. Rack Awareness
HDFS uses a rack‑awareness strategy to improve data reliability, availability, and network‑bandwidth utilization. Through this process, the NameNode can determine the rack ID of each DataNode (which is why the NameNode stores DataNode information in a NetworkTopology data structure).
A simple, unoptimized policy places replicas on different racks, preventing data loss when an entire rack fails and allowing reads to take advantage of bandwidth from multiple racks. This policy distributes replicas evenly across the cluster, which helps balance load when components fail, but because a write operation must transfer data to multiple racks, the write cost increases.
In most cases the replication factor is 3. HDFS’s placement strategy stores one replica on the local rack, a second replica on another node in the same rack, and the third replica on a node in a different rack. This reduces inter‑rack data transfer, improving write efficiency. Since rack failures are far less common than node failures, this strategy does not affect data reliability or availability. Because replicas are stored on only two different racks, the total network bandwidth required for reads is also reduced. Under this strategy the replicas are not uniformly distributed across racks: one‑third of the replicas are on a single node, two‑thirds are on the same rack, and the remaining replicas are spread across the other racks, improving write performance without harming reliability or read performance.
3. Configuration
If rack awareness is not configured, the NameNode prints a log like the following:
2016-07-17 17:27:26,423 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.147.92:50010Each IP’s rack ID is "/default-rack". Enabling rack awareness requires adding the following entries to core-site.xml:
<property>
<name>topology.script.file.name</name>
<value>/etc/hadoop/topology.sh</value>
</property>The value points to a shell script that receives a DataNode’s IP and returns the corresponding rack ID. When the NameNode starts, it checks whether rack awareness is enabled; if so, it loads the script and, upon receiving a DataNode heartbeat, passes the IP to the script and stores the returned rack ID in an internal map. A simple script example is:
#!/bin/bash
HADOOP_CONF=etc/hadoop/config
while [ $# -gt 0 ] ; do
nodeArg=$1
exec<${HADOOP_CONF}/topology.data
result=""
while read line ; do
ar=( $line )
if [ "${ar[0]}" = "$nodeArg" ]||[ "${ar[1]}" = "$nodeArg" ] ; then
result="${ar[2]}"
fi
done
shift
if [ -z "${result}" ] ; then
echo -n "/default-rack"
else
echo -n "${result}"
fi
doneThe topology.data file has the following format (node IP or hostname, switch, rack):
192.168.147.91 tbe192168147091 /dc1/rack1
192.168.147.92 tbe192168147092 /dc1/rack1
192.168.147.93 tbe192168147093 /dc1/rack2
192.168.147.94 tbe192168147094 /dc1/rack3
192.168.147.95 tbe192168147095 /dc1/rack3
192.168.147.96 tbe192168147096 /dc1/rack3You can view the rack configuration with: ./hadoop dfsadmin -printTopology This command displays the rack information.
4. Dynamically Adding Nodes
How can a new DataNode be added to the cluster without restarting the NameNode? In a rack‑aware cluster you can do the following:
Assume the Hadoop cluster is deployed on 192.168.147.68 with both NameNode and DataNode, rack awareness is enabled, and bin/hadoop dfsadmin -printTopology shows:
Rack: /dc1/rack1
192.168.147.68:50010 (dbj68)Now you want to add a DataNode at 192.168.147.69 located in rack2 without restarting the NameNode. First, edit the NameNode’s topology.data file and add the new entry:
192.168.147.68 dbj68 /dc1/rack1
192.168.147.69 dbj69 /dc1/rack2Then start the new DataNode with sbin/hadoop-daemons.sh start datanode. After starting, any node can run bin/hadoop dfsadmin -printTopology and see:
Rack: /dc1/rack1
192.168.147.68:50010 (dbj68)
Rack: /dc1/rack2
192.168.147.69:50010 (dbj69)This shows that Hadoop has recognized the newly added node. If the new node’s entry is not added to topology.data, the DataNode logs will contain errors and the node will fail to start.
Likes and shares are the greatest support!
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
