Big Data 18 min read

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

This article provides a comprehensive overview of Apache Spark, covering its origins, core concepts such as RDDs, transformations, actions, dependencies, execution modes, and key components like Spark SQL, Streaming, MLlib, and GraphX, while also offering practical code examples and visual illustrations.

dbaplus Community

Nov 27, 2015

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

Introduction

Apache Spark is currently the hottest technology in the big‑data field, promising up to 100× faster in‑memory computation than Hadoop MapReduce and 10× faster on‑disk processing. It offers high‑level APIs for Java, Scala, Python, and R, supports interactive shells, and can run on Hadoop, Mesos, standalone, or cloud environments while accessing HDFS, Cassandra, HBase, S3 and other storage systems.

Background: Why Spark Was Created

MapReduce accelerated the big‑data era but has limitations: it processes data in batches, requires multiple read/write cycles, follows a static execution model, and offers low flexibility for iterative or interactive workloads. To reduce this complexity, Spark was designed as a unified engine that combines batch, interactive, iterative, and streaming processing.

Core Concepts

Resilient Distributed Dataset (RDD)

An RDD is an immutable, partitioned collection of objects that can be operated on in parallel. It provides fault tolerance through lineage reconstruction, can be cached in memory, and is the foundation for Spark’s high‑performance iterative algorithms.

lines = sc.textFile("hdfs://...")
points = lines.map(line => parsePoint(line))
points.filter(p => p.x > 100).count()

The snippet above reads a text file from HDFS, maps each line to a point object, filters points with x greater than 100, and counts the result.

RDD Programming Interface

Operations are divided into Transformations (lazy, return a new RDD) and Actions (trigger execution and return a result).

Transformations map(func) – apply func to each element. filter(func) – keep elements where func returns true. flatMap(func) – each input can produce zero or more outputs. sample(withReplacement, frac, seed) – random sampling. union(otherDataset) – combine two RDDs. groupByKey([numTasks]) – group values by key. reduceByKey(func, [numTasks]) – aggregate values per key. join(otherDataset, [numTasks]) – inner join on keys. cartesian(otherDataset) – Cartesian product. sortByKey([ascendingOrder]) – sort by key.

Actions reduce(func) – aggregate all elements using an associative function. collect() – return all elements to the driver (use with caution). count() – number of elements. take(n) – first n elements (executed on the driver). first() – same as take(1). saveAsTextFile(path) – write each element as a line of text. saveAsSequenceFile(path) – write key‑value pairs in Hadoop SequenceFile format. foreach(func) – apply func to each element, often for side effects.

RDD Dependencies

Two types of dependencies exist between parent and child RDDs:

Narrow Dependency – each partition of the child RDD depends on at most one partition of the parent (e.g., map, filter, union).

Wide Dependency – a child partition may depend on many parent partitions (e.g., groupByKey), requiring a shuffle.

Stage DAG

When a Spark job is submitted, it is broken into stages. Stages with narrow dependencies are pipelined together, while stages separated by wide dependencies trigger shuffle operations. The resulting directed acyclic graph (DAG) defines execution order.

Spark Execution Modes

Local mode – runs on a single machine, useful for testing.

Pseudo‑distributed mode – simulates a cluster on one machine.

Cluster mode

Standalone – Spark’s own cluster manager.

YARN – integrates with Hadoop YARN for resource management.

Mesos – runs on Apache Mesos.

Spark Components

Spark SQL

Spark SQL provides a DataFrame API for structured data, offering a distributed SQL query engine. DataFrames are analogous to tables in a relational database or data frames in R/Python.

Spark Streaming

Extends the core API to support scalable, fault‑tolerant real‑time stream processing. Sources include Kafka, Flume, Twitter, Kinesis, and TCP sockets. Data is divided into micro‑batches (DStreams), which are internally sequences of RDDs.

The processing flow: incoming stream → micro‑batches → Spark engine → results (e.g., files, databases, dashboards).

Machine Learning Library (MLlib)

MLlib offers scalable machine‑learning algorithms (classification, regression, clustering, collaborative filtering, dimensionality reduction) and utilities for feature extraction, model evaluation, and pipelines.

GraphX

GraphX provides a unified API for graph‑parallel computation, exposing a property graph abstraction (vertices and edges) and a set of graph operators (e.g., subgraph, joinVertices, aggregateMessages). It enables algorithms such as PageRank and connected components on top of Spark’s RDD engine.

Conclusion

By combining RDDs, DAG scheduling, lazy evaluation, and a rich ecosystem of libraries, Spark unifies batch processing, machine learning, streaming, and graph analytics on a single platform. Its high‑level APIs let developers focus on business logic while Spark handles resource management, fault tolerance, and performance optimization.

Speaker Background

Zhuzhi Hui, IBM China Development Center Senior Software Engineer, has extensive experience in database software design, holds multiple DB2 certifications and an Oracle OCP, and has co‑authored books on DB2 design and performance. Since joining IBM in 2007, he has focused on DB2 tools and, more recently, on Spark research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Streaming Spark RDD DataFrames MLlib GraphX

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.