How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)
This comprehensive tutorial walks you through setting up a three‑node Hadoop 3.3.4 and Spark 3.3.1 environment on CentOS 7 virtual machines, covering system preparation, JDK and Scala installation, Zookeeper configuration, Hadoop and Spark deployment, and verification with practical command‑line examples.
1. Overview of the Runtime Environment
The article begins by emphasizing the importance of building a personal local Hadoop and Spark environment before diving into big‑data technologies. It provides tables (shown as images) that list software packages, tool versions, deployment topology, and process status.
2. Basic System Preparation
Step 1: Install CentOS 7 on VirtualBox (details omitted).
Step 2: Configure the host
Set hostname: vim /etc/hostname Configure /etc/hosts Install JDK 8u212, extract to /usr/local/jdk/jdk1.8.0_212 and set JAVA_HOME in
/etc/profileCopy the VM twice to obtain three machines with IPs 192.168.0.20, .21, .22, and configure password‑less SSH.
Install Zookeeper 3.4.10, extract to /usr/local/zookeeper, set ZOOKEEPER_HOME, create data/myid files (1, 2, 3) on the three nodes, and start Zookeeper with zkServer.sh start.
3. Hadoop Installation & Deployment
3.1 Install Hadoop
Upload hadoop-3.3.4.tar.gz, extract to /usr/local/hadoop/hadoop-3.3.4, and add JDK environment variables to /etc/profile.
Edit the six core configuration files: hadoop-env.sh – set
JAVA_HOME core-site.xml– temporary directory and Zookeeper quorum hdfs-site.xml – HDFS settings mapred-site.xml – MapReduce and DFS permissions yarn-site.xml – YARN resource scheduler workers – list of worker node hostnames
3.2 Start Hadoop
Run the following commands on each node:
hadoop-daemon.sh start journalnode hadoop-daemon.sh start namenode(master & slave1) hadoop-daemon.sh start datanode (all nodes) start-yarn.sh (master)
hdfs zkfc -formatZK hadoop-daemon.sh start zkfc(master)
Verify the HDFS UI at http://192.168.0.20:50070 (Active) and http://192.168.0.21:50070 (Standby).
3.3 Verify HDFS Usage
Typical commands:
hdfs dfs -ls / hdfs dfs -mkdir /input hdfs dfs -put ./test.txt /input hdfs dfs -get /input/test.txt ./tmp hdfs dfs -text /input/test.txt4. Spark Installation & Deployment
4.1 Install Scala
Upload and extract the Scala package, verify with scala -version, and copy the /usr/local/scala directory and /etc/profile to the slave machines via scp.
4.2 Install Spark
Upload spark-3.3.1.tgz, extract to /usr/local/spark, edit spark-env.sh to set JAVA_HOME and SCALA_HOME, and create a workers file listing the three node hostnames.
4.3 Start Spark
On the master node:
cd /usr/local/spark/spark-3.3.1/sbin ./start-all.shOn slave1 start the master process:
./start-master.shAccess the Spark UI to confirm the cluster is running.
4.4 Verify Spark WordCount
Run the Spark shell and execute a word‑count on a file stored in HDFS:
cd /usr/local/spark/spark-3.3.1/bin ./spark-shell --master spark://master:7077 sc.textFile("hdfs://master:9000/input/test2.txt")
.flatMap(_.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
.map(pair => (pair._2,pair._1))
.sortByKey(false)
.map(pair => (pair._2,pair._1))
.saveAsTextFile("hdfs://master:9000/spark_out")5. Conclusion
Big‑data technologies evolve rapidly, driven by digital transformation and the explosion of data. Mastering the fundamentals—such as building a personal Hadoop and Spark cluster—provides a solid foundation for further exploration and innovation in the big‑data ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
