Big Data 17 min read

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Tencent Cloud Developer

Nov 13, 2020

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark is a fast, general‑purpose engine designed for large‑scale data processing. It is widely used in data mining and machine learning and has grown into a rich ecosystem. The article is authored by Xiong Feng, a Tencent big‑data engineer.

Spark originated from the UC Berkeley AMP Lab and is now a top‑level Apache project. Learning Spark is valuable for three main reasons: high performance, strong fault tolerance, and broad applicability.

Advantages over Hadoop:

High performance – Spark keeps intermediate results in memory, avoiding the disk‑heavy MapReduce model.

High fault tolerance – Spark uses RDD lineage for data recovery and supports checkpointing to limit lineage length.

Generality – Spark provides richer APIs and operations (actions and transformations) than Hadoop.

Spark ecosystem: It supports Java, Python, R, and Scala, and can run in local, standalone, YARN, and Kubernetes modes. Key components include Spark Core, Spark Streaming, Spark SQL, MLlib, SparkR, and GraphX.

Spark Core principles: SparkContext is the entry point that hides network communication, distributed deployment, storage, and computation details. Internally it uses an event bus (based on the observer pattern) and a metrics system for monitoring.

Storage system: Spark prefers in‑memory storage and spills to disk when needed. It can read/write from HDFS, Hive, AWS S3, HBase, Elasticsearch, MySQL, PostgreSQL, Flume, Kafka, etc., and supports file formats such as txt, json, csv, parquet, orc, and avro.

Scheduling system: The DAG Scheduler splits a job into stages based on wide/narrow dependencies, while the Task Scheduler handles task execution, resource allocation, status tracking, and result collection. Spark provides FIFO and FAIR scheduling algorithms.

Spark SQL: Offers DataFrame and Dataset abstractions. DataFrame = data + schema, while Dataset adds compile‑time type safety and benefits from Spark SQL’s optimizer.

Spark Streaming: Enables real‑time stream processing, integrates with Flume and Kafka, and reuses the RDD abstraction for stream workloads.

Key characteristics of Spark:

Computation speed – up to 100× faster than Hadoop MapReduce.

Ease of use – rich APIs hide low‑level details.

Unified platform – supports batch, streaming, machine learning, and graph processing.

Cluster deployment modes: local, standalone, YARN, Mesos, and Kubernetes. A table in the original article lists each mode, its type (cluster or local), and a brief description.

Cluster roles: Cluster Manager, Worker, Executor, Driver, and Application. The Cluster Manager allocates resources, Workers host Executors, Executors run tasks, the Driver creates SparkContext and coordinates the job, and the Application contains the user code.

YARN resource manager components: ResourceManager (global resource allocation), NodeManager (node‑level monitoring), ApplicationMaster (per‑application coordinator), and containers (resource bundles). These components manage resource isolation and task lifecycle.

Spark on YARN execution modes:

Client mode – the Driver runs on the client machine, allowing interactive debugging but increasing network traffic.

Cluster mode – the Driver runs inside YARN, suitable for production as it balances network load; logs are accessed via YARN.

Spark job execution flow: SparkContext requests resources from the Cluster Manager, which allocates containers on Workers. Executors are launched inside containers, run tasks, and return results to the Driver.

RDD iteration process: SparkContext creates RDDs and builds a DAG. The DAG Scheduler partitions the DAG into stages, the Task Scheduler assigns tasks to executors, and executors perform the computation.

References are listed at the end of the original article, providing links to Spark documentation and related blog posts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Distributed Computing YARN Apache Spark DAG scheduler RDD Spark Core

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.