Master HDFS: Theory, Shell Commands, and Java API Hands‑On Guide
This comprehensive tutorial explains HDFS fundamentals, its metadata management and advantages, then walks you through setting up a Hadoop environment, executing core shell commands, and using the Java API with complete code examples, enabling you to confidently operate HDFS in practice.
HDFS Fundamentals
HDFS (Hadoop Distributed File System) stores large data across multiple machines, providing high reliability via data replication and parallel access.
Data Replication
Files are split into blocks; each block is stored on several DataNodes. Example layout:
Block 1: A B C
Block 2: A B D
Block 3: B C D
Block 4: A C D
Replication ensures fault tolerance and improves concurrent reads.
Metadata Management
The NameNode keeps the namespace, file‑to‑block mappings and block locations in memory, a metadata file and an edit log. DataNodes store the actual block bytes.
Client → HDFS → NameNode → DataNode
Advantages
Linear capacity scaling
High reliability via replication
Unified namespace simplifies user access
Practical Environment Setup
To experiment with HDFS you can use a pre‑packaged Hadoop 2.7.3 virtual machine. Required tools:
Install VirtualBox (download from https://www.virtualbox.org/wiki/Downloads).
Install Vagrant.
Download the Hadoop VM image (e.g., a .box file).
Add the box to Vagrant: vagrant box add hadoop D:\hadoop.box.
Create a working directory (e.g., D:\hdfstest) and initialize:
cd D:\hdfstest
vagrant init hadoop
vagrant upAfter the VM boots, SSH into it using the displayed IP, port 22, username root and password vagrant.
Shell Command Operations
Start HDFS inside the VM: start-dfs.sh Common commands: hdfs dfs --help – show help hdfs dfs -ls /test – list directory hdfs dfs -mkdir /test or hdfs dfs -mkdir -p /aa/bb – create directories hdfs dfs -put localPath hdfsPath – upload file hdfs dfs -cat /test/mytest.txt – display file hdfs dfs -get /test/mytest.txt ./mytest2.txt – download file hdfs dfs -getmerge /test/log.* ./log – merge and download files hdfs dfs -cp /test/mytest.txt /aa/mytest.txt.2 – copy within HDFS hdfs dfs -mv /aa/mytest.txt.2 /aa/bb – move file hdfs dfs -rm -r /aa/bb/mytest.txt.2 – delete hdfs dfs -chmod 666 /test/mytest.txt and hdfs dfs -chown user:group /test/mytest.txt – change permissions hdfs dfs -df -h / – check space hdfs dfs -du -s -h /test – directory size
Java API Operations
Configure the client to reach the VM by editing /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml and setting fs.defaultFS to hdfs://<em>VM_IP</em>:9000, then restart HDFS.
Create a Maven project hdfstest with a pom.xml and source directory src/main/java. Sample programs:
Ls.java – list files recursively
Mkdir.java – create directories
Put.java – upload a file
Get.java – download a file
Del.java – delete a file
Rename.java – rename a path
StreamGet.java – read part of a file using an input stream
Compile and run with Maven:
mvn compile
mvn exec:java -Dexec.mainClass="ClassName" -Dexec.cleanupDaemonThreads=falseWrite Mechanism
When a client writes a file, it contacts the NameNode to obtain a list of DataNodes for each block. The client then streams the data through a pipeline (e.g., A → B → C). Each block is replicated according to the configured replication factor.
Read Mechanism
The client asks the NameNode for block locations, selects nearby DataNodes, opens sockets, retrieves block data, buffers locally, and writes to the target file until the entire file is read.
NameNode Mechanism
The NameNode stores metadata in three forms: in‑memory structures, a persistent metadata file, and an edit log. Periodically it merges the edit log into a new snapshot, a costly operation offloaded to a Secondary NameNode (SecondNameNode), which also serves as a backup.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
