Big Data 10 min read

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

This comprehensive tutorial walks you through setting up a three‑node Hadoop 3.3.4 and Spark 3.3.1 environment on CentOS 7 virtual machines, covering system preparation, JDK and Scala installation, Zookeeper configuration, Hadoop and Spark deployment, and verification with practical command‑line examples.

JD Cloud Developers

Feb 23, 2023

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

1. Overview of the Runtime Environment

The article begins by emphasizing the importance of building a personal local Hadoop and Spark environment before diving into big‑data technologies. It provides tables (shown as images) that list software packages, tool versions, deployment topology, and process status.

2. Basic System Preparation

Step 1: Install CentOS 7 on VirtualBox (details omitted).

Step 2: Configure the host

Set hostname: vim /etc/hostname Configure /etc/hosts Install JDK 8u212, extract to /usr/local/jdk/jdk1.8.0_212 and set JAVA_HOME in

/etc/profile

Copy the VM twice to obtain three machines with IPs 192.168.0.20, .21, .22, and configure password‑less SSH.

Install Zookeeper 3.4.10, extract to /usr/local/zookeeper, set ZOOKEEPER_HOME, create data/myid files (1, 2, 3) on the three nodes, and start Zookeeper with zkServer.sh start.

3. Hadoop Installation & Deployment

3.1 Install Hadoop

Upload hadoop-3.3.4.tar.gz, extract to /usr/local/hadoop/hadoop-3.3.4, and add JDK environment variables to /etc/profile.

Edit the six core configuration files: hadoop-env.sh – set

JAVA_HOME

core-site.xml

– temporary directory and Zookeeper quorum hdfs-site.xml – HDFS settings mapred-site.xml – MapReduce and DFS permissions yarn-site.xml – YARN resource scheduler workers – list of worker node hostnames

3.2 Start Hadoop

Run the following commands on each node:

hadoop-daemon.sh start journalnode

hadoop-daemon.sh start namenode

(master & slave1) hadoop-daemon.sh start datanode (all nodes) start-yarn.sh (master)

hdfs zkfc -formatZK

hadoop-daemon.sh start zkfc

(master)

Verify the HDFS UI at http://192.168.0.20:50070 (Active) and http://192.168.0.21:50070 (Standby).

3.3 Verify HDFS Usage

Typical commands:

hdfs dfs -ls /

hdfs dfs -mkdir /input

hdfs dfs -put ./test.txt /input

hdfs dfs -get /input/test.txt ./tmp

hdfs dfs -text /input/test.txt

4. Spark Installation & Deployment

4.1 Install Scala

Upload and extract the Scala package, verify with scala -version, and copy the /usr/local/scala directory and /etc/profile to the slave machines via scp.

4.2 Install Spark

Upload spark-3.3.1.tgz, extract to /usr/local/spark, edit spark-env.sh to set JAVA_HOME and SCALA_HOME, and create a workers file listing the three node hostnames.

4.3 Start Spark

On the master node:

cd /usr/local/spark/spark-3.3.1/sbin

./start-all.sh

On slave1 start the master process:

./start-master.sh

Access the Spark UI to confirm the cluster is running.

4.4 Verify Spark WordCount

Run the Spark shell and execute a word‑count on a file stored in HDFS:

cd /usr/local/spark/spark-3.3.1/bin

./spark-shell --master spark://master:7077

sc.textFile("hdfs://master:9000/input/test2.txt")
    .flatMap(_.split(" "))
    .map(word => (word,1))
    .reduceByKey(_+_)
    .map(pair => (pair._2,pair._1))
    .sortByKey(false)
    .map(pair => (pair._2,pair._1))
    .saveAsTextFile("hdfs://master:9000/spark_out")

5. Conclusion

Big‑data technologies evolve rapidly, driven by digital transformation and the explosion of data. Mastering the fundamentals—such as building a personal Hadoop and Spark cluster—provides a solid foundation for further exploration and innovation in the big‑data ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Linux Spark Hadoop Cluster Setup

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.