Big Data 7 min read

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

This article provides a comprehensive guide on integrating Hive with Spark SQL, covering Hive‑on‑Spark and Spark‑on‑Hive setups, spark‑shell and spark‑sql usage, HiveServer2 with Beeline, Scala scripts for reading and writing Hive tables, and partition handling for aggregated results.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Integration Methods of Hive and Spark SQL (Potential Interview Topics)

Integration of Hive and Spark SQL

The article explains several ways to combine Hive and Spark SQL, which are frequently asked in big‑data interviews, and includes practical code snippets and configuration details.

1. Integration Methods

Two compatible approaches are presented: hive on spark – see the official Hive‑on‑Spark getting‑started page. spark on hive – refer to Spark’s documentation on Hive tables.

2. spark-shell and spark-sql Operations

When running on a cluster with TEZ, add the LZO jar to spark-defaults.conf:

spark.jars=/export/servers/hadoop-2.7.7/share/hadoop/common/hadoop-lzo-0.4.20.jar

Start spark‑shell in YARN mode:

bin/spark-shell --master yarn

3. Using HiveServer2 + Beeline

Because spark‑sql output can be unfriendly, you can start a ThriftServer and connect via Beeline:

sbin/start-thriftserver.sh
bin/beeline
!connect jdbc:hive2://hadoop102:10000

4. Script Execution with spark-sql

Typical spark‑sql commands behave exactly like standard SQL; screenshots illustrate query results.

5. Reading and Writing Hive Data in IDEA (Scala)

Add the Spark‑Hive dependency:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

Example for reading data from Hive:

import org.apache.spark.sql.SparkSession

object HiveRead {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("HiveRead")
      .enableHiveSupport()
      .getOrCreate()

    spark.sql("show databases")
    spark.sql("use guli")
    spark.sql("select count(*) from gulivideo_orc").show()
    spark.close()
  }
}

Example for writing data to Hive using an INSERT statement:

object HiveWrite {
  def main(args: Array[String]): Unit = {
    System.setProperty("HADOOP_USER_NAME", "root")
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("HiveWrite")
      .enableHiveSupport()
      .config("spark.sql.warehouse.dir", "hdfs://hadoop102:9000/user/hive/warehouse")
      .getOrCreate()

    spark.sql("create database spark1602")
    spark.sql("use spark1602")
    spark.sql("create table user1(id int, name string)")
    spark.sql("insert into user1 values(10,'lisi')").show()
    spark.close()
  }
}

Writing a DataFrame to Hive with saveAsTable (common case):

val df = spark.read.json("D:\\idea\\spark-sql\\input\\user.json")
df.write.saveAsTable("user2")
// df.write.mode("append").saveAsTable("user2")

Writing a DataFrame with insertInto (position‑based mapping):

df.write.insertInto("user2")

6. Partition Number After Aggregation

Running a grouped aggregation shows that Spark creates 200 partitions by default, which may be wasteful for small datasets. Reduce the number of output files by coalescing: df2.coalesce(1).write.saveAsTable("a4") Resulting screenshot confirms the default partition count and the effect of coalescing.

Copyright statement: This article is authored by the Big Data Technology & Architecture team and may not be reproduced without permission.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataHiveData IntegrationSparkSpark SQLScalaHiveServer2
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.