Big Data 15 min read

Master HDFS: Theory, Shell Commands, and Java API Hands‑On Guide

This comprehensive tutorial explains HDFS fundamentals, its metadata management and advantages, then walks you through setting up a Hadoop environment, executing core shell commands, and using the Java API with complete code examples, enabling you to confidently operate HDFS in practice.

dbaplus Community
dbaplus Community
dbaplus Community
Master HDFS: Theory, Shell Commands, and Java API Hands‑On Guide

HDFS Fundamentals

HDFS (Hadoop Distributed File System) stores large data across multiple machines, providing high reliability via data replication and parallel access.

Data Replication

Files are split into blocks; each block is stored on several DataNodes. Example layout:

Block 1: A B C

Block 2: A B D

Block 3: B C D

Block 4: A C D

Replication ensures fault tolerance and improves concurrent reads.

Metadata Management

The NameNode keeps the namespace, file‑to‑block mappings and block locations in memory, a metadata file and an edit log. DataNodes store the actual block bytes.

Client → HDFS → NameNode → DataNode

Advantages

Linear capacity scaling

High reliability via replication

Unified namespace simplifies user access

Practical Environment Setup

To experiment with HDFS you can use a pre‑packaged Hadoop 2.7.3 virtual machine. Required tools:

Install VirtualBox (download from https://www.virtualbox.org/wiki/Downloads).

Install Vagrant.

Download the Hadoop VM image (e.g., a .box file).

Add the box to Vagrant: vagrant box add hadoop D:\hadoop.box.

Create a working directory (e.g., D:\hdfstest) and initialize:

cd D:\hdfstest
vagrant init hadoop
vagrant up

After the VM boots, SSH into it using the displayed IP, port 22, username root and password vagrant.

Shell Command Operations

Start HDFS inside the VM: start-dfs.sh Common commands: hdfs dfs --help – show help hdfs dfs -ls /test – list directory hdfs dfs -mkdir /test or hdfs dfs -mkdir -p /aa/bb – create directories hdfs dfs -put localPath hdfsPath – upload file hdfs dfs -cat /test/mytest.txt – display file hdfs dfs -get /test/mytest.txt ./mytest2.txt – download file hdfs dfs -getmerge /test/log.* ./log – merge and download files hdfs dfs -cp /test/mytest.txt /aa/mytest.txt.2 – copy within HDFS hdfs dfs -mv /aa/mytest.txt.2 /aa/bb – move file hdfs dfs -rm -r /aa/bb/mytest.txt.2 – delete hdfs dfs -chmod 666 /test/mytest.txt and hdfs dfs -chown user:group /test/mytest.txt – change permissions hdfs dfs -df -h / – check space hdfs dfs -du -s -h /test – directory size

Java API Operations

Configure the client to reach the VM by editing /usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml and setting fs.defaultFS to hdfs://<em>VM_IP</em>:9000, then restart HDFS.

Create a Maven project hdfstest with a pom.xml and source directory src/main/java. Sample programs:

Ls.java – list files recursively

Mkdir.java – create directories

Put.java – upload a file

Get.java – download a file

Del.java – delete a file

Rename.java – rename a path

StreamGet.java – read part of a file using an input stream

Compile and run with Maven:

mvn compile
mvn exec:java -Dexec.mainClass="ClassName" -Dexec.cleanupDaemonThreads=false

Write Mechanism

When a client writes a file, it contacts the NameNode to obtain a list of DataNodes for each block. The client then streams the data through a pipeline (e.g., A → B → C). Each block is replicated according to the configured replication factor.

Read Mechanism

The client asks the NameNode for block locations, selects nearby DataNodes, opens sockets, retrieves block data, buffers locally, and writes to the target file until the entire file is read.

NameNode Mechanism

The NameNode stores metadata in three forms: in‑memory structures, a persistent metadata file, and an edit log. Periodically it merges the edit log into a new snapshot, a costly operation offloaded to a Secondary NameNode (SecondNameNode), which also serves as a backup.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed File SystemHadoopShell CommandsJava API
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.