Big Data 13 min read

Compiling and Deploying Spark 3.3.0 on CDH 6.3.2 (Cloudera) – Step‑by‑Step Guide

This guide explains how to download JDK, Maven, Scala and Spark 3.3.0, modify the Spark pom and configuration files for CDH 6.3.2, compile Spark with Maven, deploy the binaries to a client node, set up spark‑sql and spark‑submit scripts, and address common runtime issues.

政采云技术

Sep 6, 2022

Compiling and Deploying Spark 3.3.0 on CDH 6.3.2 (Cloudera) – Step‑by‑Step Guide

Background – CDH 6.3.2 and later are no longer open‑source, so components such as Spark must be compiled manually. Spark 3 offers performance improvements and adaptive execution (AE) that mitigates data skew.

Download software – Required versions: JDK 1.8, Maven 3.8.4, Scala 2.12.15, Spark 3.3.0. Do not change the minor versions of Maven and Scala unless you also adjust the pom files.

wget  http://distfiles.macports.org/scala2.12/scala-2.12.16.tgz
wget  https://archive.apache.org/dist/maven/maven-3/3.8.4/binaries/apache-maven-3.8.4-bin.tar.gz
wget  https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0.tgz

Place the tarballs under /opt, extract them, and set environment variables for JDK, Scala, and Maven.

vim /etc/profile

export JAVA_HOME=/opt/jdk1.8.0_181-cloudera
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CLASSPATH=`hadoop classpath`
export MAVEN_HOME=/opt/maven-3.8.4
export SCALA_HOME=/opt/scala-2.12.15
export PATH=$JAVA_HOME/bin:$PATH:$SCALA_HOME/bin:$HADOOP_CONF_DIR:$HADOOP_HOME:$MAVEN_HOME/bin

Compile Spark 3 – Edit /opt/spark-3.3.0/pom.xml to add Cloudera Maven repositories and set the Hadoop version to 3.0.0-cdh6.3.2. Adjust make-distribution.sh to increase Maven memory and point to the custom Maven binary.

<repository>
  <id>aliyun</id>
  <url>https://maven.aliyun.com/nexus/content/groups/public</url>
  <releases><enabled>true</enabled></releases>
  <snapshots><enabled>false</enabled></snapshots>
</repository>
<repository>
  <id>cloudera</id>
  <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
  <releases><enabled>true</enabled></releases>
  <snapshots><enabled>false</enabled></snapshots>
</repository>

Change the Hadoop version in the pom:

<hadoop.version>3.0.0-cdh6.3.2</hadoop.version>

Modify make-distribution.sh to set MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g" and use the custom Maven path.

export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g"
MVN="/opt/maven-3.8.4/bin/mvn"

Reset the Scala version and start the build:

cd /opt/spark-3.3.0
./dev/change-scala-version.sh 2.12
./dev/make-distribution.sh --name 3.0.0-cdh6.3.2 --tgz -Pyarn -Phadoop-3.0 -Phive -Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 -X

The script uses Maven under the hood; --tgz creates a tarball, --name sets the Hadoop version tag, and -Pyarn builds for YARN.

After a long compilation you will obtain a directory containing the Spark 3 binaries.

Deploy Spark 3 client – Transfer the tarball to the client machine, extract it under the CDH parcels directory, and rename the folder to spark3. Copy the existing spark-env.sh from the CDH cluster, adjust SPARK_HOME, and ensure it is executable.

tar -zxvf spark-3.3.0-bin-3.0.0-cdh6.3.2.tgz -C /opt/cloudera/parcels/CDH/lib
mv spark-3.3.0-bin-3.0.0-cdh6.3.2/ spark3
cp /etc/spark/conf/spark-env.sh /opt/cloudera/parcels/CDH/lib/spark3/conf
chmod +x /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3
HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}

Copy hive-site.xml from the gateway node to spark3/conf without modification.

cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/

Create spark‑sql wrapper – Write a small bash script that sets Hadoop/YARN configuration variables and forwards the call to the Spark 3 spark-submit command for the Hive ThriftServer driver.

#!/bin/bash
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
... (script body as in source) ...
exec $LIB_DIR/spark3/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"

Make it executable and register it with alternatives for a system‑wide spark-sql command.

chmod +x /opt/cloudera/parcels/CDH/bin/spark-sql
alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/bin/spark-sql 1

Configure Spark conf – Enable logging, copy the default spark-defaults.conf, remove unnecessary listeners, and add spark.yarn.jars=hdfs:///spark/3versionJars/*. Upload the Spark 3 jars to HDFS.

cd /opt/cloudera/parcels/CDH/lib/spark3/conf
mv log4j2.properties.template log4j2.properties
cp /opt/cloudera/parcels/CDH/lib/spark/conf/spark-defaults.conf ./
# edit spark-defaults.conf as described
hadoop fs -mkdir -p /spark/3versionJars
cd /opt/cloudera/parcels/CDH/lib/spark3/jars
hadoop fs -put *.jar /spark/3versionJars

Create spark3‑submit wrapper – Similar to spark‑sql, a bash script forwards arguments to spark3/bin/spark-class org.apache.spark.deploy.SparkSubmit. Register it with alternatives.

#!/usr/bin/env bash
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
... (script body as in source) ...
exec $LIB_DIR/spark3/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

chmod +x /opt/cloudera/parcels/CDH/bin/spark3-submit
alternatives --install /usr/bin/spark3-submit spark3-submit /opt/cloudera/parcels/CDH/bin/spark3-submit 1

Test spark3‑submit – Run a Spark Pi example on YARN in cluster mode.

spark3-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster \
  --driver-memory 4g --executor-memory 2g --executor-cores 1 \
  --queue root.default /opt/cloudera/parcels/CDH/lib/spark3/examples/jars/spark-examples*.jar 10

Notes – If Spark Dynamic Allocation is enabled, you may encounter a FetchFailed error. Adding spark.shuffle.useOldFetchProtocol=true to spark-defaults.conf resolves the issue.

With these steps, the Hadoop cluster now runs both CDH‑bundled Spark 2.4.0 and the newly compiled Apache Spark 3.3.0.

Recruitment – The article ends with a hiring call for the Zero technology team in Hangzhou, inviting interested engineers to contact [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Compilation maven Spark Hadoop Scala CDH

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.