Compiling and Deploying Spark 3.3.0 on CDH 6.3.2 (Cloudera) – Step‑by‑Step Guide
This guide explains how to download JDK, Maven, Scala and Spark 3.3.0, modify the Spark pom and configuration files for CDH 6.3.2, compile Spark with Maven, deploy the binaries to a client node, set up spark‑sql and spark‑submit scripts, and address common runtime issues.
Background – CDH 6.3.2 and later are no longer open‑source, so components such as Spark must be compiled manually. Spark 3 offers performance improvements and adaptive execution (AE) that mitigates data skew.
Download software – Required versions: JDK 1.8, Maven 3.8.4, Scala 2.12.15, Spark 3.3.0. Do not change the minor versions of Maven and Scala unless you also adjust the pom files.
wget http://distfiles.macports.org/scala2.12/scala-2.12.16.tgz
wget https://archive.apache.org/dist/maven/maven-3/3.8.4/binaries/apache-maven-3.8.4-bin.tar.gz
wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0.tgzPlace the tarballs under /opt , extract them, and set environment variables for JDK, Scala, and Maven.
vim /etc/profile
export JAVA_HOME=/opt/jdk1.8.0_181-cloudera
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CLASSPATH=`hadoop classpath`
export MAVEN_HOME=/opt/maven-3.8.4
export SCALA_HOME=/opt/scala-2.12.15
export PATH=$JAVA_HOME/bin:$PATH:$SCALA_HOME/bin:$HADOOP_CONF_DIR:$HADOOP_HOME:$MAVEN_HOME/binCompile Spark 3 – Edit /opt/spark-3.3.0/pom.xml to add Cloudera Maven repositories and set the Hadoop version to 3.0.0-cdh6.3.2 . Adjust make-distribution.sh to increase Maven memory and point to the custom Maven binary.
<repository>
<id>aliyun</id>
<url>https://maven.aliyun.com/nexus/content/groups/public</url>
<releases><enabled>true</enabled></releases>
<snapshots><enabled>false</enabled></snapshots>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
<releases><enabled>true</enabled></releases>
<snapshots><enabled>false</enabled></snapshots>
</repository>Change the Hadoop version in the pom:
<hadoop.version>3.0.0-cdh6.3.2</hadoop.version>Modify make-distribution.sh to set MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g" and use the custom Maven path.
export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g"
MVN="/opt/maven-3.8.4/bin/mvn"Reset the Scala version and start the build:
cd /opt/spark-3.3.0
./dev/change-scala-version.sh 2.12
./dev/make-distribution.sh --name 3.0.0-cdh6.3.2 --tgz -Pyarn -Phadoop-3.0 -Phive -Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 -XThe script uses Maven under the hood; --tgz creates a tarball, --name sets the Hadoop version tag, and -Pyarn builds for YARN.
After a long compilation you will obtain a directory containing the Spark 3 binaries.
Deploy Spark 3 client – Transfer the tarball to the client machine, extract it under the CDH parcels directory, and rename the folder to spark3 . Copy the existing spark-env.sh from the CDH cluster, adjust SPARK_HOME , and ensure it is executable.
tar -zxvf spark-3.3.0-bin-3.0.0-cdh6.3.2.tgz -C /opt/cloudera/parcels/CDH/lib
mv spark-3.3.0-bin-3.0.0-cdh6.3.2/ spark3
cp /etc/spark/conf/spark-env.sh /opt/cloudera/parcels/CDH/lib/spark3/conf
chmod +x /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3
HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}Copy hive-site.xml from the gateway node to spark3/conf without modification.
cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/Create spark‑sql wrapper – Write a small bash script that sets Hadoop/YARN configuration variables and forwards the call to the Spark 3 spark-submit command for the Hive ThriftServer driver.
#!/bin/bash
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
... (script body as in source) ...
exec $LIB_DIR/spark3/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"Make it executable and register it with alternatives for a system‑wide spark-sql command.
chmod +x /opt/cloudera/parcels/CDH/bin/spark-sql
alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/bin/spark-sql 1Configure Spark conf – Enable logging, copy the default spark-defaults.conf , remove unnecessary listeners, and add spark.yarn.jars=hdfs:///spark/3versionJars/* . Upload the Spark 3 jars to HDFS.
cd /opt/cloudera/parcels/CDH/lib/spark3/conf
mv log4j2.properties.template log4j2.properties
cp /opt/cloudera/parcels/CDH/lib/spark/conf/spark-defaults.conf ./
# edit spark-defaults.conf as described
hadoop fs -mkdir -p /spark/3versionJars
cd /opt/cloudera/parcels/CDH/lib/spark3/jars
hadoop fs -put *.jar /spark/3versionJarsCreate spark3‑submit wrapper – Similar to spark‑sql, a bash script forwards arguments to spark3/bin/spark-class org.apache.spark.deploy.SparkSubmit . Register it with alternatives .
#!/usr/bin/env bash
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
... (script body as in source) ...
exec $LIB_DIR/spark3/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@" chmod +x /opt/cloudera/parcels/CDH/bin/spark3-submit
alternatives --install /usr/bin/spark3-submit spark3-submit /opt/cloudera/parcels/CDH/bin/spark3-submit 1Test spark3‑submit – Run a Spark Pi example on YARN in cluster mode.
spark3-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster \
--driver-memory 4g --executor-memory 2g --executor-cores 1 \
--queue root.default /opt/cloudera/parcels/CDH/lib/spark3/examples/jars/spark-examples*.jar 10Notes – If Spark Dynamic Allocation is enabled, you may encounter a FetchFailed error. Adding spark.shuffle.useOldFetchProtocol=true to spark-defaults.conf resolves the issue.
With these steps, the Hadoop cluster now runs both CDH‑bundled Spark 2.4.0 and the newly compiled Apache Spark 3.3.0.
Recruitment – The article ends with a hiring call for the Zero technology team in Hangzhou, inviting interested engineers to contact [email protected] .
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.