From SQL to RDD: Understanding Spark's Internal Architecture
This article explains how Spark converts SQL queries into RDD operations by creating a SparkSession, registering temporary views, executing SQL, and then detailing the underlying InternalRow, TreeNode, and Expression structures that power the Catalyst optimizer.
The article demonstrates the end‑to‑end process of converting SQL statements into RDD operations in Apache Spark. It starts by creating a SparkSession, reading a JSON file, registering it as a temporary view, and then running a SQL query.
// 创建SparkSession类。从2.0开始逐步替代SparkContext称为Spark应用入口
var spark = SparkSession.builder().appName("appName").master("local").getOrCreate()
//创建数据表并读取数据
spark.read.json("./test.json").createOrReplaceTempView("test_table")
//通过SQL进行数据分析。可输入任何满足语法的语句
spark.sql("select name from test_table where a > 1").show()After the SQL query is executed, Spark translates it into a logical plan, optimizes it with the Catalyst optimizer, and finally generates a physical plan composed of RDD transformations.
The article includes visual diagrams of the SQL‑to‑RDD conversion steps and the actual transformation process (images omitted for brevity).
Internally, Spark represents each row using the InternalRow hierarchy. Classes such as BaseGenericInternalRow, GenericInternalRow, SpecificInternalRow, MutableUnsafeRow, JoinedRow, and UnsafeRow provide different storage and mutability characteristics, enabling efficient processing and code generation.
The TreeNode hierarchy forms the backbone of Spark SQL's logical and physical plan trees. Every node inherits common traversal and collection methods, and specific subclasses like Expression and QueryPlan represent expressions and execution plans respectively.
Catalyst also offers node‑positioning capabilities, allowing developers to locate the exact SQL substring that corresponds to a given TreeNode, which is helpful for debugging.
The Expression subsystem defines various expression types used in query planning, including:
Nondeterministic (e.g., Rand())
Unevaluable (throws an exception when evaluated)
CodegenFallback (cannot be code‑generated, often third‑party UDFs)
LeafExpression (no child nodes, e.g., Star, CurrentDate)
UnaryExpression (single child, e.g., Abs)
BinaryExpression (two children)
TernaryExpression (three children)
These expressions are also TreeNode subclasses, so they inherit traversal methods and can be composed into complex expression trees.
The article concludes with additional diagrams of Spark's internal data system and a reminder to like, share, and bookmark the post.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
