Integration Methods of Hive and Spark SQL (Potential Interview Topics)
This article provides a comprehensive guide on integrating Hive with Spark SQL, covering Hive‑on‑Spark and Spark‑on‑Hive setups, spark‑shell and spark‑sql usage, HiveServer2 with Beeline, Scala scripts for reading and writing Hive tables, and partition handling for aggregated results.
Integration of Hive and Spark SQL
The article explains several ways to combine Hive and Spark SQL, which are frequently asked in big‑data interviews, and includes practical code snippets and configuration details.
1. Integration Methods
Two compatible approaches are presented: hive on spark – see the official Hive‑on‑Spark getting‑started page. spark on hive – refer to Spark’s documentation on Hive tables.
2. spark-shell and spark-sql Operations
When running on a cluster with TEZ, add the LZO jar to spark-defaults.conf:
spark.jars=/export/servers/hadoop-2.7.7/share/hadoop/common/hadoop-lzo-0.4.20.jarStart spark‑shell in YARN mode:
bin/spark-shell --master yarn3. Using HiveServer2 + Beeline
Because spark‑sql output can be unfriendly, you can start a ThriftServer and connect via Beeline:
sbin/start-thriftserver.sh bin/beeline
!connect jdbc:hive2://hadoop102:100004. Script Execution with spark-sql
Typical spark‑sql commands behave exactly like standard SQL; screenshots illustrate query results.
5. Reading and Writing Hive Data in IDEA (Scala)
Add the Spark‑Hive dependency:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>Example for reading data from Hive:
import org.apache.spark.sql.SparkSession
object HiveRead {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.appName("HiveRead")
.enableHiveSupport()
.getOrCreate()
spark.sql("show databases")
spark.sql("use guli")
spark.sql("select count(*) from gulivideo_orc").show()
spark.close()
}
}Example for writing data to Hive using an INSERT statement:
object HiveWrite {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
val spark = SparkSession.builder()
.master("local[*]")
.appName("HiveWrite")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://hadoop102:9000/user/hive/warehouse")
.getOrCreate()
spark.sql("create database spark1602")
spark.sql("use spark1602")
spark.sql("create table user1(id int, name string)")
spark.sql("insert into user1 values(10,'lisi')").show()
spark.close()
}
}Writing a DataFrame to Hive with saveAsTable (common case):
val df = spark.read.json("D:\\idea\\spark-sql\\input\\user.json")
df.write.saveAsTable("user2")
// df.write.mode("append").saveAsTable("user2")Writing a DataFrame with insertInto (position‑based mapping):
df.write.insertInto("user2")6. Partition Number After Aggregation
Running a grouped aggregation shows that Spark creates 200 partitions by default, which may be wasteful for small datasets. Reduce the number of output files by coalescing: df2.coalesce(1).write.saveAsTable("a4") Resulting screenshot confirms the default partition count and the effect of coalescing.
Copyright statement: This article is authored by the Big Data Technology & Architecture team and may not be reproduced without permission.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
