How to Write Spark DataFrames to Hive Tables and Partitions
This article explains how to persist Spark DataFrames into Hive tables and specific partitions, covering the relevant write APIs, the need to select a database, and providing step‑by‑step Scala code examples for both Spark 1.6 and Spark 2.x versions, along with Hive table creation syntax.
When a DataFrame is written to Hive without specifying a database, it defaults to Hive's default database; the insertInto method alone cannot target a specific database or partition.
1. Write a DataFrame to a Hive table
Spark provides several write APIs related to Hive, such as registerTempTable, insertInto, and saveAsTable. The registerTempTable method creates a temporary Spark table, while insertInto writes data to an existing Hive table but cannot specify the database or partition directly.
To write data to a specific Hive database, first switch the database with hiveContext.sql("use DataBaseName"), then use insertInto on the DataFrame.
registerTempTable(tableName: String): Unit insertInto(tableName: String): Unit insertInto(tableName: String, overwrite: Boolean): Unit saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): UnitExample: write to a specific database table (Spark 1.6)
case class Person(name: String, col1: Int, col2: String)
val sc = new org.apache.spark.SparkContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits._
hiveContext.sql("use DataBaseName")
val data = sc.textFile("path").map(x => x.split("\\s+")).map(x => Person(x(0), x(1).toInt, x(2)))
data.toDF().insertInto("tableName")This code defines a case class, reads raw text, converts it to a DataFrame, switches to the target database, and inserts the data into the Hive table.
2. Write a DataFrame to a Hive partitioned table
First write the DataFrame to a temporary table, then use a Hive SQL statement to insert into the partitioned table.
case class Person(name: String, col1: Int, col2: String)
val sc = new org.apache.spark.SparkContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits _
hiveContext.sql("use DataBaseName")
val data = sc.textFile("path").map(x => x.split("\\s+")).map(x => Person(x(0), x(1).toInt, x(2)))
data.toDF().registerTempTable("table1")
hiveContext.sql("insert into table2 partition(date='2015-04-02') select name, col1, col2 from table1")The above code works with Spark 1.6.
Example for Spark 2.0 and later
val session = SparkSession.builder().appName("WarehouseInventoryByNewMysqlSnap").enableHiveSupport().getOrCreate()
val sc: SparkContext = session.sparkContext
session.sql("use bi_work")
import session.implicits._
val data = sc.textFile("path").map(x => x.split("\\s+")).map(x => Person(x(0), x(1).toInt, x(2)))
data.toDF().registerTempTable("table1")
session.sql("insert into table2 partition(date='2015-04-02') select name, col1, col2 from table1")Hive partition tables are created by specifying the PARTITIONED BY clause. Each partition corresponds to a directory under the table’s folder, column names are case‑insensitive, and the partition column can be inspected with desc table_name.
Creating a Hive external partitioned table
CREATE EXTERNAL TABLE bi_work.`dw_inventory_snap`
(
`warehouse_id` string COMMENT '',
`internal_id` string COMMENT '',
`logical_inventory` string COMMENT '',
`create_time` timestamp COMMENT ''
)
PARTITIONED BY (`snap_time` string)
row format delimited fields terminated by '\t'— THE END —
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
