Understanding Spark SQL: Origin, Features, and Columnar Storage
This article explains the evolution of Spark SQL from Shark, describes its key features such as SchemaRDD and in‑memory columnar storage, compares row‑based and column‑based storage, and provides practical Scala code examples for creating DataFrames and loading data from various sources.
1. Origin of SparkSQL
SparkSQL originated from the Shark project, which was built on Hive to give RDBMS‑familiar users a faster way to run SQL on Hadoop. Because Shark depended heavily on Hive components, the Spark team created SparkSQL to remove that dependency and achieve a unified stack.
2. Features of SparkSQL
SparkSQL introduces a new RDD type called SchemaRDD (later renamed DataFrame) that lets users define schemas like traditional databases. It supports mixing data from HiveQL and other sources, embeds a query optimizer that converts SQL to logical plans and finally to RDD execution, and provides significant performance gains.
2.1 In‑Memory Columnar Storage
Instead of storing rows as JVM objects, SparkSQL stores each column in a contiguous memory array. This reduces memory overhead, improves cache locality, and eliminates the large GC pressure caused by millions of row objects.
2.2 Storage Implementation
Primitive column types are stored in native arrays, while complex types (array, map, etc.) are serialized into a byte array. Columnar layout enables efficient compression (dictionary, run‑length, etc.) and faster CPU‑bound type conversion.
2.3 Row vs. Column Storage
Row‑based storage writes data in a single pass, guaranteeing atomicity but may read unnecessary columns. Column‑based storage writes each column separately, incurring more I/O for writes but allowing reads to fetch only required columns, which is ideal for OLAP workloads.
Advantages of Columnar Storage
Reduced I/O during reads because only needed columns are accessed.
Higher compression ratios due to homogeneous data types.
Better CPU efficiency for analytical queries.
Use Cases
Columnar databases excel in OLAP scenarios where queries touch millions of rows but only a few columns, such as “top‑20 sales items”. They are less suitable for OLTP workloads that require frequent row‑level updates.
3. Creating DataFrames
DataFrames are read‑only tables built from RDDs. Example for a single column:
scala> val rdd = sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> rdd.toDF("id")
res0: org.apache.spark.sql.DataFrame = [id: int]
scala> res0.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
+---+Creating a multi‑column DataFrame:
scala> val rdd = sc.parallelize(List((1,"beijing"),(2,"shanghai")))
res3: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[5] at parallelize at <console>:22
scala> rdd.toDF("id","name")
res4: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> res4.show
+---+--------+
| id| name|
+---+--------+
| 1| beijing|
| 2| shanghai|
+---+--------+Three‑column example:
scala> val rdd = sc.parallelize(List((1,"beijing",100780),(2,"shanghai",560090),(3,"xi'an",600329)))
res6: org.apache.spark.rdd.RDD[(Int, String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:22
scala> rdd.toDF("id","name","postcode")
res7: org.apache.spark.sql.DataFrame = [id: int, name: string, postcode: int]
scala> res7.show
+---+--------+--------+
| id| name|postcode|
+---+--------+--------+
| 1| beijing| 100780|
| 2| shanghai| 560090|
| 3| xi'an| 600329|
+---+--------+--------+4. Building DataFrames from External Files
4.1 Text Files
Read a text file, map each line to a tuple, then convert to a DataFrame:
val rdd = sc.textFile("/root/words.txt")
.map(x => (x,1))
.reduceByKey(_+_)
val df = rdd.toDF("word","count")
df.show()4.2 JSON Files
import org.apache.spark.sql.SQLContext
val sqlc = new SQLContext(sc)
val df = sqlc.read.json("/home/software/people.json")
df.show()4.3 Parquet Files
val df = sqlc.read.parquet("/home/software/users.parquet")
df.show()4.4 JDBC Sources
import org.apache.spark.sql.SQLContext
val sqlc = new SQLContext(sc)
val prop = new java.util.Properties
prop.put("user","root")
prop.put("password","root")
val df = sqlc.read.jdbc("jdbc:mysql://hadoop01:3306/test","tabx",prop)
df.show()These examples demonstrate how SparkSQL can ingest data from diverse sources and expose them as DataFrames for SQL‑like analytics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
