Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization
This article provides a thorough overview of Apache Spark, covering its origins, comparison with MapReduce, core concepts such as RDD, DAG, Jobs, Stages, and Tasks, the submission process, Web UI, and detailed performance tuning techniques including data skew mitigation.
Author Introduction – Zhou Ming, algorithm engineer at Qunar.com, shares his experience in recommendation algorithms.
What This Article Offers – A quick understanding of Spark for beginners, a review of fundamentals for newcomers, and a knowledge refresher for experienced users.
1. Spark Origin – Big data technologies trace back to Google’s 2004 papers (GFS, MapReduce, BigTable). Hadoop (HDFS + MapReduce) emerged in 2006, followed by higher‑level tools Pig and Hive. In 2012, Spark was introduced by UC Berkeley to address MapReduce’s heavy I/O and limited expressiveness, leveraging in‑memory computation.
2. MapReduce Limitations – Only Map and Reduce operators, heavy disk I/O, poor support for iterative, interactive, and streaming workloads, and inflexible programming model.
3. Why Spark? – Faster execution (up to 100× Hadoop) due to RDD/DAG in‑memory processing, rich operator library, multi‑language APIs (Java, Scala, Python, R, SQL), a unified ecosystem (Spark SQL, Streaming, MLlib, GraphX), and portability across YARN, Hadoop, and various data sources.
4. Core Spark Concepts
RDD – Resilient Distributed Dataset, the fundamental immutable data structure.
DAG – Directed Acyclic Graph representing job stages and dependencies.
Job – A set of transformations ending with an action.
Stage – A portion of a DAG that can be executed without further shuffles.
Task – The smallest unit of work executed on a partition within a stage.
5. Spark Submission Process – The driver process is launched in three modes (local, yarn‑client, yarn‑cluster). It requests executor resources, splits the job into stages, creates tasks for each stage, and schedules them on executors.
6. Spark Web UI – Provides visibility into job DAGs, stage breakdowns, executor and task metrics, serving as the starting point for performance tuning.
7. Performance Optimization
Reduce data redundancy: filter unnecessary records, ensure unique join keys, and cache reusable RDDs.
Persistence levels: prefer MEMORY_ONLY, fallback to MEMORY_ONLY_SER, MEMORY_AND_DISK, etc., avoiding DISK_ONLY unless necessary.
Parameter tuning: adjust spark.sql.shuffle.partitions and other Spark configs.
Data skew mitigation: identify skewed keys via Web UI, filter or hash‑partition skewed keys, increase shuffle parallelism, use broadcast joins for small tables, or apply key‑hash‑then‑aggregate techniques.
8. Summary – The article covered Spark’s historical background, core concepts, submission workflow, UI, and a suite of optimization strategies to help readers understand and effectively use Spark in real‑world scenarios.
Recruitment Notice – Qunar.com is hiring interns and technical experts across multiple positions.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.