Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning
This guide explains how to deploy Apache Kyuubi on a CDH 6 cluster, replace HiveServer2 with Kyuubi, integrate Spark 3, apply necessary patches, configure environment and Spark settings, and optimize engine sharing for various workloads, providing complete code snippets and step‑by‑step instructions.
Kyuubi is an open‑source big‑data project incubated by the Apache Software Foundation that offers a multi‑tenant C/S architecture, encapsulating Spark compute resources for downstream services and aiming to democratize big‑data processing.
Typical usage scenarios include replacing HiveServer2 for 10‑100× performance gains, building serverless Spark platforms, and constructing unified data‑lake exploration and analysis platforms.
CDH 6.3.1 ships with Hadoop 3.0.0, Hive 2.1.1, and Spark 2.4.0. After Spark 3.0’s release, the article describes integrating Spark 3 into CDH 6.3.1 (without Kerberos) and using Kyuubi to replace HiveServer2 for seamless HiveQL‑to‑SparkSQL migration.
ORC Compatibility Fix
When Hive reads ORC files written by Presto or Spark, the error ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException: 6 occurs. The issue is fixed upstream (ORC‑125). A patched JAR is provided; replace /opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec-2.1.1-cdh6.3.1.jar and hive-orc-2.1.1-cdh6.3.1.jar on all nodes.
ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException: 6Spark 3 Adjustments
Spark 3 uses the Hadoop shaded client (Hadoop 3.2) to avoid dependency conflicts. Apply patch SPARK‑33212 to enable the shaded client. Additionally, apply CDH‑71907 to adapt Spark’s HiveShim to CDH’s modified Hive 2.1.1 signatures.
When Spark 3 interacts with CDH’s older External Shuffle Service, set spark.shuffle.useOldFetchProtocol=true to avoid IllegalArgumentException: Unexpected message type: <number>. .
IllegalArgumentException: Unexpected message type: <number>.Refer to the official Spark migration guides before upgrading.
Spark Deployment on YARN
Only the YARN client node needs Spark 3 installed. Configure Hadoop and Hive configuration files via symbolic links:
ln -s /etc/hadoop/conf/core-site.xml /opt/spark3/conf/
ln -s /etc/hadoop/conf/hdfs-site.xml /opt/spark3/conf/
ln -s /etc/hadoop/conf/yarn-site.xml /opt/spark3/conf/
ln -s /etc/hive/conf/hive-site.xml /opt/spark3/conf/Set environment variables in /opt/spark3/conf/spark-env.sh :
#!/usr/bin/env bash
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
export YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn:/etc/hive/confKey Spark defaults (excerpt):
spark.authenticate=false
spark.io.encryption.enabled=false
spark.network.crypto.enabled=false
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory
spark.driver.memory=2G
spark.executor.cores=6
spark.executor.memory=8G
spark.shuffle.service.enabled=true
spark.shuffle.useOldFetchProtocol=true
spark.sql.adaptive.enabled=true
... (additional settings omitted for brevity)Validate the installation with /opt/spark3/bin/spark-shell , confirming the Spark UI and successful SQL queries.
Kyuubi Deployment
Download kyuubi-1.3.0-incubating-bin.tgz , extract to /opt , and create a symlink /opt/kyuubi -> /opt/kyuubi-1.3.0-incubating-bin :
[hive@cdh-external opt]$ ls -l /opt | grep kyuubi
lrwxrwxrwx 1 root root 32 Aug 18 17:36 kyuubi -> /opt/kyuubi-1.3.0-incubating-bin
drwxrwxr-x 13 hive hive 4096 Aug 18 17:52 kyuubi-1.3.0-incubating-binEdit /opt/kyuubi/conf/kyuubi-env.sh to set Java, Spark, Hadoop, and Kyuubi directories:
#!/usr/bin/env bash
export JAVA_HOME=/usr/java/default
export SPARK_HOME=/opt/spark3
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
export KYUUBI_PID_DIR=/data/log/service/kyuubi/pid
export KYUUBI_LOG_DIR=/data/log/service/kyuubi/logs
export KYUUBI_WORK_DIR_ROOT=/data/log/service/kyuubi/work
export KYUUBI_MAX_LOG_FILES=10Configure core Kyuubi settings in /opt/kyuubi/conf/kyuubi-defaults.conf (excerpt):
kyuubi.authentication=NONE
kyuubi.engine.share.level=USER
kyuubi.frontend.bind.host=0.0.0.0
kyuubi.frontend.bind.port=10009
kyuubi.ha.zookeeper.quorum=cdh-master1:2181,cdh-master2:2181,cdh-master3:2181
kyuubi.ha.zookeeper.namespace=kyuubi
kyuubi.session.engine.idle.timeout=PT10H
spark.master=yarn
spark.submit.deployMode=cluster
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=0Engine Sharing Levels
Kyuubi offers three sharing levels—CONNECTION, USER (default), and SERVER—allowing trade‑offs between isolation and resource utilization. Example configurations for three typical workloads:
Ad‑hoc queries via HUE (USER level, dynamic allocation, idle timeout 1 h).
Batch jobs via Beeline (CONNECTION level, min executors 5, max 30).
Superset federated queries (USER level, longer idle timeout, min 6, max 10).
Corresponding spark.dynamicAllocation.* settings are provided in code blocks throughout the article.
Starting Kyuubi
[hive@cdh-kyuubi]$ /opt/kyuubi/bin/kyuubi --help
Usage: bin/kyuubi command
commands:
start - Run a Kyuubi server as a daemon
run - Run a Kyuubi server in the foreground
stop - Stop the Kyuubi daemon
status - Show status of the Kyuubi daemon
-h | --help - Show this help messageConnecting with Beeline
beeline -u jdbc:hive2://cdh-kyuubi:10009 -n bigdataCustom connection with extra Spark options:
beeline -u "jdbc:hive2://cdh-master2:10009/;?spark.driver.memory=8G#spark.app.name=batch_001;kyuubi.engine.share.level=CONNECTION" -n batchConnecting with Hue
[desktop]
app_blacklist=zookeeper,hbase,impala,search,sqoop,security
use_new_editor=true
[[interpreters]]
[[[sparksql]]]
name=Spark SQL
interface=hiveserver2
[[[hive]]]
name=Hive
interface=hiveserver2
# other interpreters
...
[spark]
sql_server_host=kyuubi
sql_server_port=10009After deployment, Spark 3 on CDH 6 works like the built‑in Spark 2.4 for spark-submit and spark-shell , though Spark‑SQL and Spark‑ThriftServer are not supported unless the appropriate hive-service-rpc-3.1.2.jar is added to /opt/spark3/jars .
Finally, run SQL queries to experience the performance boost offered by Spark SQL through Kyuubi.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.