Big Data 7 min read

Understanding Apache Spark: Architecture, Comparison with Hadoop, Features, and Use Cases

The article explains Apache Spark’s memory‑based distributed computing model, its advantages over Hadoop’s MapReduce, key features, fault tolerance, deployment modes, ecosystem components, and the scenarios where Spark is most effective for large‑scale data analytics.

Qunar Tech Salon

Dec 4, 2014

Understanding Apache Spark: Architecture, Comparison with Hadoop, Features, and Use Cases

This article answers several common questions about Apache Spark, such as the algorithm behind its distributed computation, how it differs from MapReduce, why it is more flexible than Hadoop, its limitations, and the situations where it should be used.

What is Spark

Apache Spark is an open‑source, general‑purpose parallel computing framework originated from the UC Berkeley AMP Lab. It implements distributed computation based on the MapReduce algorithm, but unlike classic MapReduce it can keep intermediate results in memory, eliminating the need for frequent HDFS reads and writes. This makes Spark especially suitable for iterative algorithms used in data mining and machine learning.

Spark vs. Hadoop

Spark stores intermediate data in memory, which greatly improves the efficiency of iterative computations. It also provides a richer set of operations (transformations such as map, filter, flatMap, groupByKey, join, etc., and actions like count, collect, reduce) compared with Hadoop’s only map and reduce primitives, giving developers more flexibility.

Because of the RDD abstraction, Spark is better suited for machine‑learning and data‑mining workloads that require many passes over the same dataset.

Generality of Spark

Beyond the basic map and reduce, Spark supports numerous transformations (e.g., sample, union, sortBy, partitionBy) and actions, allowing fine‑grained control over data partitioning, storage, and materialization. This makes the programming model more expressive than Hadoop’s fixed shuffle pattern.

However, due to the immutable nature of RDDs, Spark is not ideal for asynchronous fine‑grained updates such as web‑service storage or incremental web crawling and indexing.

Fault Tolerance

Spark achieves fault tolerance through checkpointing, which can be performed either on data or by logging updates; users can choose the method that best fits their workload.

Usability

Rich APIs for Scala, Java, and Python, together with an interactive shell, make Spark highly usable for developers.

Integration with Hadoop

Spark can read and write directly to HDFS and runs on YARN, allowing it to share resources with existing Hadoop clusters. It also integrates with Hive via Shark, providing near‑full Hive compatibility.

Suitable Scenarios

Spark excels in memory‑intensive iterative computations where the same dataset is processed many times. The more iterations required, the greater the performance benefit. For small datasets with low computational intensity, the advantage diminishes.

Because of RDD characteristics, Spark is unsuitable for workloads that need fine‑grained, asynchronous state updates.

Overall, Spark’s applicability is broad and fairly general.

Deployment Modes

Local mode

Standalone mode

Mesos mode

YARN mode

Spark Ecosystem

Shark (Hive on Spark) provides HiveQL compatibility by reusing Hive’s parser and logical plan generation, while executing the physical plan on Spark, enabling in‑memory caching of RDDs for faster query performance.

Spark Streaming processes live data by dividing streams into small time slices (micro‑batches) and handling each slice with the regular Spark engine, offering low‑latency processing and fault tolerance comparable to batch jobs.

Bagel brings Pregel‑style graph computation to Spark, including an example implementation of Google’s PageRank algorithm.

(Source: CSDN Big Data)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data processing Distributed Computing Spark Hadoop

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.