Big Data 6 min read

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

This article explains why Spark is a mature big‑data framework, recommends which Spark versions to study, lists essential research papers, describes how to set up the development environment, and outlines the key components of Spark’s core architecture for effective source‑code exploration.

Big Data Technology & Architecture

Feb 8, 2020

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

Spark has become a highly mature framework in the big‑data computing field, offering unmatched stability and community development compared to other big‑data processing frameworks.

Spark has only gone through three major version lines—1.x, 2.x, and 3.x—while the earliest implementation visible on GitHub is version 0.5, which contained just over 10,000 lines of code to realize Spark’s core functionality.

If you have already used Spark and now want to read its source code, it is advisable to pick a 2.x version—preferably 2.3 or 2.4—because later versions have expanded the codebase dramatically, so focusing on the most relevant parts is essential.

Fundamental Concepts

For first‑time readers, understanding Spark’s design philosophy, its abstractions, and the motivations behind the introduction of RDDs is crucial.

Here are a few recommended papers:

Resilient Distributed Datasets: A Fault‑Tolerant Abstraction for In‑Memory Cluster Computing https://fasionchan.com/blog/2017/10/19/yi-wen-tan-xing-fen-bu-shi-shu-ju-ji-yi-zhong-wei-nei-cun-hua-ji-qun-ji-suan-she-ji-de-rong-cuo-mo-xing/

Fast and General Data Processing Architecture on Large Clusters https://blog.csdn.net/weixin_44024821/article/details/89948115

Environment Preparation

Setting up a Spark source‑code development environment requires JDK, Scala, Maven (or other build tools), and typically takes 1–4 hours, mainly due to compilation time. Any online tutorial can guide you through the setup; using Maven is recommended over sbt.

Spark Core Design

The following are the most important modules in Spark’s core design:

Spark Initialization

SparkContext, SparkEnv, SparkConf, RpcEnv, SparkStatusTracker, SecurityManager, SparkUI, MetricsSystem, TaskScheduler

Spark Storage System

SerializerManager, BroadcastManager, ShuffleManager, MemoryManager, NettyBlockTransferService, BlockManagerMaster, BlockManager, CacheManager

Spark Memory Management

MemoryManager, MemoryPool, ExecutionMemoryPool, StorageMemoryPool, MemoryStore, UnifiedMemoryManager

Spark Execution System

LiveListenerBus, MapOutputTracker, DAGScheduler, TaskScheduler, ExecutorAllocationManager, OutputCommitCoordinator, ContextCleaner

Spark Deployment Modes

Local, SparkCluster, Standalone, Master/Executor/Worker fault tolerance

Spark Streaming

StreamingContext, Receiver, DStream, window operations

Spark SQL

Catalog, TreeNode, lexical parser, RuleExecutor, Analyzer, Optimizer, HiveSQL integration

Other

If you are interested in graph processing (Spark GraphX) or machine learning (Spark MLlib), you can explore those components separately; most real‑time computation packages are already covered by the directories above.

Reading source code is an essential step for every developer, and its benefits are self‑evident.

-- MORE | 更多精彩文章 --

Kafka源码阅读最最最简单的入门方法

欢迎点赞+收藏+转发朋友圈素质三连

长按扫描下方👇二维码注册

加群主微信，送一份Java和大数据学习大礼包！

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Apache Spark Source Code RDD

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.