Integrating SparkSQL with Hive: Configuration, MetaStore Setup, and Example Scala Code
This article explains the differences between Spark on Hive and Hive on Spark, then provides step‑by‑step instructions for configuring Hive MetaStore, setting up SparkSQL to use Hive, and demonstrates a complete Scala program that creates a Hive table, loads data, and queries it.
This article explains the differences between Spark on Hive and Hive on Spark, then provides step‑by‑step instructions for enabling Hive MetaStore, configuring SparkSQL to use Hive, and demonstrates a complete Scala example that creates a Hive table, loads data, and queries it.
Differences between Spark on Hive and Hive on Spark
Spark on Hive uses Spark‑SQL to execute Hive statements while still running on Spark RDDs. Hive on Spark replaces the traditional MapReduce engine with Spark RDDs, requiring recompilation and additional JARs.
Prerequisites
Refer to the official Apache Spark documentation for SQL data sources with Hive tables: http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html .
Configuration of Hive
Modify hive/conf/hive-site.xml to set the warehouse directory, disable local mode, and specify MetaStore URIs.
<?xml version="1.0"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://node01:9083</value>
</property>
</configuration>Start the Hive MetaStore service:
nohup /export/servers/hive/bin/hive --service metastore 2>&1 >> /var/log.log &SparkSQL integration with Hive MetaStore
Copy the Hive and Hadoop configuration files into Spark’s configuration directory so Spark can access the MetaStore and HDFS warehouse.
cp /export/servers/hive-1.1.0-cdh5.14.0/conf/hive-site.xml /export/servers/spark/conf
cp /export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop/core-site.xml /export/servers/spark/conf
cp /export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop/hdfs-site.xml /export/servers/spark/confTip: when testing locally in IDEA, place these files in the resources directory.
Example Scala program
import org.apache.spark.sql.SparkSession
object HiveSupport {
def main(args: Array[String]): Unit = {
// Create SparkSession with Hive support
val spark = SparkSession.builder()
.appName("HiveSupport")
.master("local[*]")
.config("spark.sql.warehouse.dir", "hdfs://node01:8020/user/hive/warehouse")
.config("hive.metastore.uris", "thrift://node01:9083")
.enableHiveSupport() // enable Hive syntax support
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
// Show existing tables
spark.sql("show tables").show()
// Create a new table
spark.sql("CREATE TABLE person (id int, name string, age int) row format delimited fields terminated by ' '")
// Load data from a local file
spark.sql("LOAD DATA LOCAL INPATH 'in/person.txt' INTO TABLE person")
// Query the table
spark.sql("select * from person").show()
spark.stop()
}
}Before running the program, check the existing tables in the Hive shell (e.g., show tables;). After execution, the new person table appears in Hive, and its contents can be verified both via SparkSQL output and Hive CLI.
The article concludes with a friendly reminder to like, bookmark, and share the post.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
