Understanding Apache Spark: Architecture, Comparison with Hadoop, Features, and Use Cases
The article explains Apache Spark’s memory‑based distributed computing model, its advantages over Hadoop’s MapReduce, key features, fault tolerance, deployment modes, ecosystem components, and the scenarios where Spark is most effective for large‑scale data analytics.
This article answers several common questions about Apache Spark, such as the algorithm behind its distributed computation, how it differs from MapReduce, why it is more flexible than Hadoop, its limitations, and the situations where it should be used.
What is Spark
Apache Spark is an open‑source, general‑purpose parallel computing framework originated from the UC Berkeley AMP Lab. It implements distributed computation based on the MapReduce algorithm, but unlike classic MapReduce it can keep intermediate results in memory, eliminating the need for frequent HDFS reads and writes. This makes Spark especially suitable for iterative algorithms used in data mining and machine learning.
Spark vs. Hadoop
Spark stores intermediate data in memory, which greatly improves the efficiency of iterative computations. It also provides a richer set of operations (transformations such as map, filter, flatMap, groupByKey, join, etc., and actions like count, collect, reduce) compared with Hadoop’s only map and reduce primitives, giving developers more flexibility.
Because of the RDD abstraction, Spark is better suited for machine‑learning and data‑mining workloads that require many passes over the same dataset.
Generality of Spark
Beyond the basic map and reduce, Spark supports numerous transformations (e.g., sample, union, sortBy, partitionBy) and actions, allowing fine‑grained control over data partitioning, storage, and materialization. This makes the programming model more expressive than Hadoop’s fixed shuffle pattern.
However, due to the immutable nature of RDDs, Spark is not ideal for asynchronous fine‑grained updates such as web‑service storage or incremental web crawling and indexing.
Fault Tolerance
Spark achieves fault tolerance through checkpointing, which can be performed either on data or by logging updates; users can choose the method that best fits their workload.
Usability
Rich APIs for Scala, Java, and Python, together with an interactive shell, make Spark highly usable for developers.
Integration with Hadoop
Spark can read and write directly to HDFS and runs on YARN, allowing it to share resources with existing Hadoop clusters. It also integrates with Hive via Shark, providing near‑full Hive compatibility.
Suitable Scenarios
Spark excels in memory‑intensive iterative computations where the same dataset is processed many times. The more iterations required, the greater the performance benefit. For small datasets with low computational intensity, the advantage diminishes.
Because of RDD characteristics, Spark is unsuitable for workloads that need fine‑grained, asynchronous state updates.
Overall, Spark’s applicability is broad and fairly general.
Deployment Modes
Local mode
Standalone mode
Mesos mode
YARN mode
Spark Ecosystem
Shark (Hive on Spark) provides HiveQL compatibility by reusing Hive’s parser and logical plan generation, while executing the physical plan on Spark, enabling in‑memory caching of RDDs for faster query performance.
Spark Streaming processes live data by dividing streams into small time slices (micro‑batches) and handling each slice with the regular Spark engine, offering low‑latency processing and fault tolerance comparable to batch jobs.
Bagel brings Pregel‑style graph computation to Spark, including an example implementation of Google’s PageRank algorithm.
(Source: CSDN Big Data)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
