Big Data 20 min read

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

This article provides an in‑depth technical overview of Apache Spark, covering its core concepts such as RDDs, transformation and action operations, execution models, Spark 2.0 enhancements like unified DataFrames/Datasets, whole‑stage code generation, Structured Streaming, and practical performance‑tuning guidance.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

Apache Spark is a leading Apache project that serves as a fast, general‑purpose engine for big‑data processing, supporting offline batch jobs, interactive queries, machine‑learning algorithms, streaming, and graph computations.

Compared with MapReduce, Spark uses a DAG execution engine and in‑memory iterative computation, delivering up to ten times faster performance on disk and up to a hundred times on memory.

An RDD (Resilient Distributed Dataset) is a read‑only, partitioned collection of records. Transformations create new RDDs lazily, while actions trigger execution. Example Scala code illustrates basic RDD operations: val file = sc.textFile(args(0)) val words = file.flatMap(line => line.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.saveAsTextFile(args(1))

The generated execution tree is managed by the block manager, which persists intermediate results on executors to avoid recomputation. Stages are built from the dependency graph (narrow vs. wide) and scheduled as tasks.

Spark can run in four deployment modes—local, Yarn, Standalone, and Mesos—each offering client or cluster driver placement. Yarn‑cluster mode is illustrated as a typical production setup.

Spark 2.0 introduced several major features: the unification of DataFrames and Datasets (DataFrame becomes Dataset[Row]), whole‑stage code generation that compiles multiple physical operators into a single loop for near‑hand‑coded speed, and Structured Streaming that unifies batch and streaming APIs using the Dataset engine.

Additional performance improvements include a vectorized Parquet decoder, radix sort for faster sorting, a vectorized hashmap for group‑by, native window functions, elimination of redundant calculations, LZ4 compression, and many SQL syntax enhancements (intersect/except, subqueries, DDL/DML, etc.). Compatibility changes such as removal of Hadoop < 2.2 support, HTTPBroadcast, HashShuffleManager, and Akka RPC are also noted.

The article concludes with a Q&A covering Hive‑to‑Spark migration, memory‑management strategies, checkpointing and exactly‑once semantics in Spark Streaming, typical application scenarios, hardware sizing for Spark‑HDFS clusters, RDD lifecycle, and job scheduling considerations.

performance optimizationBig DataSparkRDDDataFramesStructured Streaming
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.