Big Data 10 min read

Understanding Spark DataFrames: Creation Methods, Optimizations, and Common Operations

This article explains the origins of Spark DataFrames, compares them with RDDs, describes how Spark SQL optimizes DataFrame execution, and provides detailed examples of creating DataFrames from RDDs, files, and JDBC sources along with common DataFrame operations and code snippets.

Big Data Technology & Architecture

Dec 15, 2021

Understanding Spark DataFrames: Creation Methods, Optimizations, and Common Operations

Spark introduced DataFrames (also called SchemaRDD) in version 1.3, offering a higher‑level API for handling large‑scale structured data compared to raw RDDs. While DataFrames have a weaker DSL than RDD's higher‑order functions, Spark SQL can apply heuristic and runtime‑based optimizations to improve performance.

The relationship between Spark Core and Spark SQL is that Spark SQL builds on Spark Core's execution engine, translating SQL queries into RDD operations and leveraging Core's task scheduling, storage, and shuffle capabilities.

DataFrame creation methods:

createDataFrame & toDF – Use spark.createDataFrame(rdd, schema) where the RDD must be of type RDD[Row] . Example:

import org.apache.spark.sql.types._
val schema = StructType(List(
  StructField("name", StringType, nullable = false),
  StructField("age", IntegerType, nullable = false),
  StructField("birthday", DateType, nullable = false)
))
val rdd = spark.sparkContext.parallelize(Seq(
  Row("小明", 18, java.sql.Date.valueOf("1990-01-01")),
  Row("小芳", 20, java.sql.Date.valueOf("1999-02-01"))
))
val df = spark.createDataFrame(rdd, schema)
df.show()

toDF – Import spark.implicits._ and call .toDF("col1", "col2", ...) on an RDD or Seq. Example:

import spark.implicits._
val df = Seq(("小明", 18, java.sql.Date.valueOf("1990-01-01")),
             ("小芳", 20, java.sql.Date.valueOf("1999-02-01")))
  .toDF("name", "age", "birthday")
df.show()

From files – Use spark.read.format("csv").option(...).load("path/demo.csv") to read CSV, JSON, Parquet, etc.

val spark = SparkSession.builder().appName("csv reader").master("local").getOrCreate()
val result = spark.read.format("csv")
  .option("delimiter", ",")
  .option("header", "true")
  .option("nullValue", "\\N")
  .option("inferSchema", "true")
  .load("path/demo.csv")
result.show()
result.printSchema()

From external data sources – Load from JDBC, e.g., MySQL:

val url = "jdbc:mysql://localhost:3306/test"
val df = spark.read.format("jdbc")
  .option("url", url)
  .option("dbtable", "test")
  .option("user", "admin")
  .option("password", "admin")
  .load()
df.show()

Common DataFrame operations:

Single‑row query using temporary view and SQL.

Group‑by aggregation with group by and avg.

Window functions for ranking and calculating running aggregates.

Examples of these operations are provided in the article with full Scala code snippets.

The article concludes by summarizing that it covered Spark SQL origins, DataFrame creation techniques, and frequently used operators, and hints at future topics such as the Catalyst optimizer, Tungsten execution engine, and join strategy selection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL SparkSQL dataframe Spark Scala

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.