Big Data 5 min read

From SQL to RDD: Understanding Spark's Internal Architecture

This article explains how Spark converts SQL queries into RDD operations by creating a SparkSession, registering temporary views, executing SQL, and then detailing the underlying InternalRow, TreeNode, and Expression structures that power the Catalyst optimizer.

Big Data Technology & Architecture

May 28, 2020

From SQL to RDD: Understanding Spark's Internal Architecture

The article demonstrates the end‑to‑end process of converting SQL statements into RDD operations in Apache Spark. It starts by creating a SparkSession, reading a JSON file, registering it as a temporary view, and then running a SQL query.

// 创建SparkSession类。从2.0开始逐步替代SparkContext称为Spark应用入口
var spark = SparkSession.builder().appName("appName").master("local").getOrCreate()
//创建数据表并读取数据
spark.read.json("./test.json").createOrReplaceTempView("test_table")
//通过SQL进行数据分析。可输入任何满足语法的语句
spark.sql("select name from test_table where a > 1").show()

After the SQL query is executed, Spark translates it into a logical plan, optimizes it with the Catalyst optimizer, and finally generates a physical plan composed of RDD transformations.

The article includes visual diagrams of the SQL‑to‑RDD conversion steps and the actual transformation process (images omitted for brevity).

Internally, Spark represents each row using the InternalRow hierarchy. Classes such as BaseGenericInternalRow, GenericInternalRow, SpecificInternalRow, MutableUnsafeRow, JoinedRow, and UnsafeRow provide different storage and mutability characteristics, enabling efficient processing and code generation.

The TreeNode hierarchy forms the backbone of Spark SQL's logical and physical plan trees. Every node inherits common traversal and collection methods, and specific subclasses like Expression and QueryPlan represent expressions and execution plans respectively.

Catalyst also offers node‑positioning capabilities, allowing developers to locate the exact SQL substring that corresponds to a given TreeNode, which is helpful for debugging.

The Expression subsystem defines various expression types used in query planning, including:

Nondeterministic (e.g., Rand())

Unevaluable (throws an exception when evaluated)

CodegenFallback (cannot be code‑generated, often third‑party UDFs)

LeafExpression (no child nodes, e.g., Star, CurrentDate)

UnaryExpression (single child, e.g., Abs)

BinaryExpression (two children)

TernaryExpression (three children)

These expressions are also TreeNode subclasses, so they inherit traversal methods and can be composed into complex expression trees.

The article concludes with additional diagrams of Spark's internal data system and a reminder to like, share, and bookmark the post.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL Spark RDD expression Catalyst InternalRow TreeNode

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.