Big Data 15 min read

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

This article provides a thorough overview of Apache Spark, covering its origins, comparison with MapReduce, core concepts such as RDD, DAG, Jobs, Stages, and Tasks, the submission process, Web UI, and detailed performance tuning techniques including data skew mitigation.

Qunar Tech Salon

Aug 26, 2021

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

Author Introduction – Zhou Ming, algorithm engineer at Qunar.com, shares his experience in recommendation algorithms.

What This Article Offers – A quick understanding of Spark for beginners, a review of fundamentals for newcomers, and a knowledge refresher for experienced users.

1. Spark Origin – Big data technologies trace back to Google’s 2004 papers (GFS, MapReduce, BigTable). Hadoop (HDFS + MapReduce) emerged in 2006, followed by higher‑level tools Pig and Hive. In 2012, Spark was introduced by UC Berkeley to address MapReduce’s heavy I/O and limited expressiveness, leveraging in‑memory computation.

2. MapReduce Limitations – Only Map and Reduce operators, heavy disk I/O, poor support for iterative, interactive, and streaming workloads, and inflexible programming model.

3. Why Spark? – Faster execution (up to 100× Hadoop) due to RDD/DAG in‑memory processing, rich operator library, multi‑language APIs (Java, Scala, Python, R, SQL), a unified ecosystem (Spark SQL, Streaming, MLlib, GraphX), and portability across YARN, Hadoop, and various data sources.

4. Core Spark Concepts

RDD – Resilient Distributed Dataset, the fundamental immutable data structure.

DAG – Directed Acyclic Graph representing job stages and dependencies.

Job – A set of transformations ending with an action.

Stage – A portion of a DAG that can be executed without further shuffles.

Task – The smallest unit of work executed on a partition within a stage.

5. Spark Submission Process – The driver process is launched in three modes (local, yarn‑client, yarn‑cluster). It requests executor resources, splits the job into stages, creates tasks for each stage, and schedules them on executors.

6. Spark Web UI – Provides visibility into job DAGs, stage breakdowns, executor and task metrics, serving as the starting point for performance tuning.

7. Performance Optimization

Reduce data redundancy: filter unnecessary records, ensure unique join keys, and cache reusable RDDs.

Persistence levels: prefer MEMORY_ONLY, fallback to MEMORY_ONLY_SER, MEMORY_AND_DISK, etc., avoiding DISK_ONLY unless necessary.

Parameter tuning: adjust spark.sql.shuffle.partitions and other Spark configs.

Data skew mitigation: identify skewed keys via Web UI, filter or hash‑partition skewed keys, increase shuffle parallelism, use broadcast joins for small tables, or apply key‑hash‑then‑aggregate techniques.

8. Summary – The article covered Spark’s historical background, core concepts, submission workflow, UI, and a suite of optimization strategies to help readers understand and effectively use Spark in real‑world scenarios.

Recruitment Notice – Qunar.com is hiring interns and technical experts across multiple positions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Data Skew MapReduce Spark RDD

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.