Big Data 19 min read

Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning

This guide explains how to deploy Apache Kyuubi on a CDH 6 cluster, replace HiveServer2 with Kyuubi, integrate Spark 3, apply necessary patches, configure environment and Spark settings, and optimize engine sharing for various workloads, providing complete code snippets and step‑by‑step instructions.

Big Data Technology Architecture

Sep 28, 2021

Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning

Kyuubi is an open‑source big‑data project incubated by the Apache Software Foundation that offers a multi‑tenant C/S architecture, encapsulating Spark compute resources for downstream services and aiming to democratize big‑data processing.

Typical usage scenarios include replacing HiveServer2 for 10‑100× performance gains, building serverless Spark platforms, and constructing unified data‑lake exploration and analysis platforms.

CDH 6.3.1 ships with Hadoop 3.0.0, Hive 2.1.1, and Spark 2.4.0. After Spark 3.0’s release, the article describes integrating Spark 3 into CDH 6.3.1 (without Kerberos) and using Kyuubi to replace HiveServer2 for seamless HiveQL‑to‑SparkSQL migration.

ORC Compatibility Fix

When Hive reads ORC files written by Presto or Spark, the error

ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException: 6

occurs. The issue is fixed upstream (ORC‑125). A patched JAR is provided; replace

/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec-2.1.1-cdh6.3.1.jar

and hive-orc-2.1.1-cdh6.3.1.jar on all nodes.

ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException: 6

Spark 3 Adjustments

Spark 3 uses the Hadoop shaded client (Hadoop 3.2) to avoid dependency conflicts. Apply patch SPARK‑33212 to enable the shaded client. Additionally, apply CDH‑71907 to adapt Spark’s HiveShim to CDH’s modified Hive 2.1.1 signatures.

When Spark 3 interacts with CDH’s older External Shuffle Service, set spark.shuffle.useOldFetchProtocol=true to avoid

IllegalArgumentException: Unexpected message type: <number>.

IllegalArgumentException: Unexpected message type: <number>.

Refer to the official Spark migration guides before upgrading.

Spark Deployment on YARN

Only the YARN client node needs Spark 3 installed. Configure Hadoop and Hive configuration files via symbolic links:

ln -s /etc/hadoop/conf/core-site.xml /opt/spark3/conf/
ln -s /etc/hadoop/conf/hdfs-site.xml /opt/spark3/conf/
ln -s /etc/hadoop/conf/yarn-site.xml /opt/spark3/conf/
ln -s /etc/hive/conf/hive-site.xml /opt/spark3/conf/

Set environment variables in /opt/spark3/conf/spark-env.sh:

#!/usr/bin/env bash
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
export YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn:/etc/hive/conf

Key Spark defaults (excerpt):

spark.authenticate=false
spark.io.encryption.enabled=false
spark.network.crypto.enabled=false
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory
spark.driver.memory=2G
spark.executor.cores=6
spark.executor.memory=8G
spark.shuffle.service.enabled=true
spark.shuffle.useOldFetchProtocol=true
spark.sql.adaptive.enabled=true
... (additional settings omitted for brevity)

Validate the installation with /opt/spark3/bin/spark-shell, confirming the Spark UI and successful SQL queries.

Kyuubi Deployment

Download kyuubi-1.3.0-incubating-bin.tgz, extract to /opt, and create a symlink /opt/kyuubi -> /opt/kyuubi-1.3.0-incubating-bin:

[hive@cdh-external opt]$ ls -l /opt | grep kyuubi
lrwxrwxrwx  1 root root 32 Aug 18 17:36 kyuubi -> /opt/kyuubi-1.3.0-incubating-bin
drwxrwxr-x 13 hive hive 4096 Aug 18 17:52 kyuubi-1.3.0-incubating-bin

Edit /opt/kyuubi/conf/kyuubi-env.sh to set Java, Spark, Hadoop, and Kyuubi directories:

#!/usr/bin/env bash
export JAVA_HOME=/usr/java/default
export SPARK_HOME=/opt/spark3
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
export KYUUBI_PID_DIR=/data/log/service/kyuubi/pid
export KYUUBI_LOG_DIR=/data/log/service/kyuubi/logs
export KYUUBI_WORK_DIR_ROOT=/data/log/service/kyuubi/work
export KYUUBI_MAX_LOG_FILES=10

Configure core Kyuubi settings in /opt/kyuubi/conf/kyuubi-defaults.conf (excerpt):

kyuubi.authentication=NONE
kyuubi.engine.share.level=USER
kyuubi.frontend.bind.host=0.0.0.0
kyuubi.frontend.bind.port=10009
kyuubi.ha.zookeeper.quorum=cdh-master1:2181,cdh-master2:2181,cdh-master3:2181
kyuubi.ha.zookeeper.namespace=kyuubi
kyuubi.session.engine.idle.timeout=PT10H
spark.master=yarn
spark.submit.deployMode=cluster
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=0

Engine Sharing Levels

Kyuubi offers three sharing levels—CONNECTION, USER (default), and SERVER—allowing trade‑offs between isolation and resource utilization. Example configurations for three typical workloads:

Ad‑hoc queries via HUE (USER level, dynamic allocation, idle timeout 1 h).

Batch jobs via Beeline (CONNECTION level, min executors 5, max 30).

Superset federated queries (USER level, longer idle timeout, min 6, max 10).

Corresponding spark.dynamicAllocation.* settings are provided in code blocks throughout the article.

Starting Kyuubi

[hive@cdh-kyuubi]$ /opt/kyuubi/bin/kyuubi --help
Usage: bin/kyuubi command
  commands:
    start        - Run a Kyuubi server as a daemon
    run          - Run a Kyuubi server in the foreground
    stop         - Stop the Kyuubi daemon
    status       - Show status of the Kyuubi daemon
    -h | --help  - Show this help message

Connecting with Beeline

beeline -u jdbc:hive2://cdh-kyuubi:10009 -n bigdata

Custom connection with extra Spark options:

beeline -u "jdbc:hive2://cdh-master2:10009/;?spark.driver.memory=8G#spark.app.name=batch_001;kyuubi.engine.share.level=CONNECTION" -n batch

Connecting with Hue

[desktop]
 app_blacklist=zookeeper,hbase,impala,search,sqoop,security
 use_new_editor=true
[[interpreters]]
[[[sparksql]]]
  name=Spark SQL
  interface=hiveserver2
[[[hive]]]
  name=Hive
  interface=hiveserver2
# other interpreters
  ...
[spark]
sql_server_host=kyuubi
sql_server_port=10009

After deployment, Spark 3 on CDH 6 works like the built‑in Spark 2.4 for spark-submit and spark-shell, though Spark‑SQL and Spark‑ThriftServer are not supported unless the appropriate hive-service-rpc-3.1.2.jar is added to /opt/spark3/jars.

Finally, run SQL queries to experience the performance boost offered by Spark SQL through Kyuubi.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL YARN Spark Kyuubi CDH HiveServer2

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.