Big Data 10 min read

How to Write Spark DataFrames to Hive Tables and Partitions

This article explains how to persist Spark DataFrames into Hive tables and specific partitions, covering the relevant write APIs, the need to select a database, and providing step‑by‑step Scala code examples for both Spark 1.6 and Spark 2.x versions, along with Hive table creation syntax.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
How to Write Spark DataFrames to Hive Tables and Partitions

When a DataFrame is written to Hive without specifying a database, it defaults to Hive's default database; the insertInto method alone cannot target a specific database or partition.

1. Write a DataFrame to a Hive table

Spark provides several write APIs related to Hive, such as registerTempTable, insertInto, and saveAsTable. The registerTempTable method creates a temporary Spark table, while insertInto writes data to an existing Hive table but cannot specify the database or partition directly.

To write data to a specific Hive database, first switch the database with hiveContext.sql("use DataBaseName"), then use insertInto on the DataFrame.

registerTempTable(tableName: String): Unit
insertInto(tableName: String): Unit
insertInto(tableName: String, overwrite: Boolean): Unit
saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

Example: write to a specific database table (Spark 1.6)

case class Person(name: String, col1: Int, col2: String)
val sc = new org.apache.spark.SparkContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits._
hiveContext.sql("use DataBaseName")
val data = sc.textFile("path").map(x => x.split("\\s+")).map(x => Person(x(0), x(1).toInt, x(2)))
data.toDF().insertInto("tableName")

This code defines a case class, reads raw text, converts it to a DataFrame, switches to the target database, and inserts the data into the Hive table.

2. Write a DataFrame to a Hive partitioned table

First write the DataFrame to a temporary table, then use a Hive SQL statement to insert into the partitioned table.

case class Person(name: String, col1: Int, col2: String)
val sc = new org.apache.spark.SparkContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits _
hiveContext.sql("use DataBaseName")
val data = sc.textFile("path").map(x => x.split("\\s+")).map(x => Person(x(0), x(1).toInt, x(2)))
data.toDF().registerTempTable("table1")
hiveContext.sql("insert into table2 partition(date='2015-04-02') select name, col1, col2 from table1")

The above code works with Spark 1.6.

Example for Spark 2.0 and later

val session = SparkSession.builder().appName("WarehouseInventoryByNewMysqlSnap").enableHiveSupport().getOrCreate()
val sc: SparkContext = session.sparkContext
session.sql("use bi_work")
import session.implicits._
val data = sc.textFile("path").map(x => x.split("\\s+")).map(x => Person(x(0), x(1).toInt, x(2)))
data.toDF().registerTempTable("table1")
session.sql("insert into table2 partition(date='2015-04-02') select name, col1, col2 from table1")

Hive partition tables are created by specifying the PARTITIONED BY clause. Each partition corresponds to a directory under the table’s folder, column names are case‑insensitive, and the partition column can be inspected with desc table_name.

Creating a Hive external partitioned table

CREATE EXTERNAL TABLE bi_work.`dw_inventory_snap`
(
  `warehouse_id` string COMMENT '',
  `internal_id` string COMMENT '',
  `logical_inventory` string COMMENT '',
  `create_time` timestamp COMMENT ''
)
PARTITIONED BY (`snap_time` string)
row format delimited fields terminated by '\t'

— THE END —

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSQLHivedataframeSparkScala
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.