Big Data 16 min read

Understanding Spark SQL: Origin, Features, and Columnar Storage

This article explains the evolution of Spark SQL from Shark, describes its key features such as SchemaRDD and in‑memory columnar storage, compares row‑based and column‑based storage, and provides practical Scala code examples for creating DataFrames and loading data from various sources.

Big Data Technology & Architecture

Jun 10, 2019

Understanding Spark SQL: Origin, Features, and Columnar Storage

1. Origin of SparkSQL

SparkSQL originated from the Shark project, which was built on Hive to give RDBMS‑familiar users a faster way to run SQL on Hadoop. Because Shark depended heavily on Hive components, the Spark team created SparkSQL to remove that dependency and achieve a unified stack.

2. Features of SparkSQL

SparkSQL introduces a new RDD type called SchemaRDD (later renamed DataFrame) that lets users define schemas like traditional databases. It supports mixing data from HiveQL and other sources, embeds a query optimizer that converts SQL to logical plans and finally to RDD execution, and provides significant performance gains.

2.1 In‑Memory Columnar Storage

Instead of storing rows as JVM objects, SparkSQL stores each column in a contiguous memory array. This reduces memory overhead, improves cache locality, and eliminates the large GC pressure caused by millions of row objects.

2.2 Storage Implementation

Primitive column types are stored in native arrays, while complex types (array, map, etc.) are serialized into a byte array. Columnar layout enables efficient compression (dictionary, run‑length, etc.) and faster CPU‑bound type conversion.

2.3 Row vs. Column Storage

Row‑based storage writes data in a single pass, guaranteeing atomicity but may read unnecessary columns. Column‑based storage writes each column separately, incurring more I/O for writes but allowing reads to fetch only required columns, which is ideal for OLAP workloads.

Advantages of Columnar Storage

Reduced I/O during reads because only needed columns are accessed.

Higher compression ratios due to homogeneous data types.

Better CPU efficiency for analytical queries.

Use Cases

Columnar databases excel in OLAP scenarios where queries touch millions of rows but only a few columns, such as “top‑20 sales items”. They are less suitable for OLTP workloads that require frequent row‑level updates.

3. Creating DataFrames

DataFrames are read‑only tables built from RDDs. Example for a single column:

scala> val rdd = sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> rdd.toDF("id")
res0: org.apache.spark.sql.DataFrame = [id: int]
scala> res0.show
+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
+---+

Creating a multi‑column DataFrame:

scala> val rdd = sc.parallelize(List((1,"beijing"),(2,"shanghai")))
res3: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[5] at parallelize at <console>:22
scala> rdd.toDF("id","name")
res4: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> res4.show
+---+--------+
| id|    name|
+---+--------+
|  1|  beijing|
|  2| shanghai|
+---+--------+

Three‑column example:

scala> val rdd = sc.parallelize(List((1,"beijing",100780),(2,"shanghai",560090),(3,"xi'an",600329)))
res6: org.apache.spark.rdd.RDD[(Int, String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:22
scala> rdd.toDF("id","name","postcode")
res7: org.apache.spark.sql.DataFrame = [id: int, name: string, postcode: int]
scala> res7.show
+---+--------+--------+
| id|    name|postcode|
+---+--------+--------+
|  1|  beijing| 100780|
|  2| shanghai| 560090|
|  3|   xi'an| 600329|
+---+--------+--------+

4. Building DataFrames from External Files

4.1 Text Files

Read a text file, map each line to a tuple, then convert to a DataFrame:

val rdd = sc.textFile("/root/words.txt")
  .map(x => (x,1))
  .reduceByKey(_+_)
val df = rdd.toDF("word","count")
df.show()

4.2 JSON Files

import org.apache.spark.sql.SQLContext
val sqlc = new SQLContext(sc)
val df = sqlc.read.json("/home/software/people.json")
df.show()

4.3 Parquet Files

val df = sqlc.read.parquet("/home/software/users.parquet")
df.show()

4.4 JDBC Sources

import org.apache.spark.sql.SQLContext
val sqlc = new SQLContext(sc)
val prop = new java.util.Properties
prop.put("user","root")
prop.put("password","root")
val df = sqlc.read.jdbc("jdbc:mysql://hadoop01:3306/test","tabx",prop)
df.show()

These examples demonstrate how SparkSQL can ingest data from diverse sources and expose them as DataFrames for SQL‑like analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SparkSQL JDBC dataframe Scala Parquet

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.