Big Data 11 min read

How to Set Up Hadoop Java Development on Windows and Access HDFS via Java API

This guide walks through installing Hadoop on Windows, configuring environment variables and XML files, adding the required winutils binaries, verifying the setup with HDFS shell commands, and then building a Maven project that uses the Java API to list and inspect files in HDFS.

The Dominant Programmer

Aug 4, 2021

How to Set Up Hadoop Java Development on Windows and Access HDFS via Java API

Install Hadoop on Windows using the same version as the CentOS cluster (2.8.0). Download the tarball, extract it with 7‑zip (first to .tar, then to a directory), and move the resulting folder to a path without Chinese characters or spaces.

Download the matching hadoop.dll and winutils.exe (e.g., from https://github.com/cdarlint/winutils) and copy them into the Hadoop bin directory.

Add the Hadoop bin and sbin directories to the system PATH environment variable.

Configuration files

hadoop-env.cmd : set JAVA_HOME to the JDK path, e.g., set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_202.

core-site.xml :

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.148.128:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>D:\SoftWare\hadoop-2.8.0\hdfs\tmp</value>
    </property>
</configuration>

hdfs-site.xml :

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
</configuration>

mapred-site.xml :

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml (replace all manager addresses with the master IP 192.168.148.128):

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>192.168.148.128:18040</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>192.168.148.128:18030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>192.168.148.128:18025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>192.168.148.128:18141</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>192.168.148.128:18088</value>
    </property>
</configuration>

Verify the installation in a Windows cmd prompt: hadoop version List the HDFS root directory:

hdfs dfs -ls /

Maven project for Java API access

Create a Maven project in IntelliJ IDEA. The pom.xml must contain Hadoop dependencies matching the cluster version (2.8.0) and JUnit for testing:

<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.8.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.8.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.8.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs-client</artifactId>
        <version>2.8.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>2.8.0</version>
    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
        <scope>test</scope>
    </dependency>
</dependencies>

Create package com.badao.hdfsdemo and class hellohdfs with the following source:

package com.badao.hdfsdemo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.io.IOException;

public class hellohdfs {
    public static void main(String[] args) throws IOException {
        FileSystem fileSystem = getFileSystem();
        RemoteIterator<LocatedFileStatus> listFiles = fileSystem.listFiles(new Path("/"), true);
        while (listFiles.hasNext()) {
            LocatedFileStatus status = listFiles.next();
            System.out.println(status.getPath().getName()); // file name
            System.out.println(status.getLen()); // length
            System.out.println(status.getPermission()); // permission
            System.out.println(status.getOwner()); // owner
            System.out.println(status.getGroup()); // group
            System.out.println(status.getModificationTime()); // modification time
            BlockLocation[] blockLocations = status.getBlockLocations();
            for (BlockLocation blockLocation : blockLocations) {
                String[] hosts = blockLocation.getHosts();
                for (String host : hosts) {
                    System.out.println(host);
                }
            }
        }
        fileSystem.close();
    }

    /** Obtain the HDFS FileSystem instance. */
    public static FileSystem getFileSystem() throws IOException {
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS", "hdfs://192.168.148.128:9000");
        configuration.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        System.setProperty("HADOOP_USER_NAME", "root");
        return FileSystem.get(configuration);
    }
}

Key configuration lines in getFileSystem():

configuration.set("fs.defaultFS", "hdfs://192.168.148.128:9000")

matches the address defined in core-site.xml.

configuration.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem")

binds the HDFS implementation and prevents the runtime error No FileSystem for scheme: hdfs. System.setProperty("HADOOP_USER_NAME", "root") avoids an AccessControlException caused by a missing user permission.

Running the main method prints the name, size, permissions, owner, group, modification timestamp, and the host nodes for each block of every file under the HDFS root, confirming that the Java API can successfully interact with the Hadoop cluster from a Windows workstation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java configuration Maven Windows HDFS Hadoop

Written by

The Dominant Programmer

Resources and tutorials for programmers' advanced learning journey. Advanced tracks in Java, Python, and C#. Blog: https://blog.csdn.net/badao_liumang_qizhi

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.