Overview of Big Data Processing Engines: MapReduce, Tez, Spark, and Flink
This article reviews the evolution and characteristics of major big‑data processing engines—from first‑generation Hadoop MapReduce to second‑generation DAG‑based Tez, third‑generation in‑memory Spark, and fourth‑generation real‑time Flink—highlighting their batch and streaming use cases.
The previous article introduced basic big‑data concepts; this follow‑up focuses on the computation engines that power big‑data processing, describing their main features and suitable scenarios without delving into low‑level implementation details.
1. Timeline of Computing Engines Big‑data processing engines have evolved through four generations. The current mainstream engines are third‑generation Spark and the newer fourth‑generation Flink.
2. Batch Processing – MapReduce, Tez, and Spark
2.1 MapReduce Hadoop’s MapReduce splits work into a map phase (data partitioning) and a reduce phase (aggregation). The data flow can be expressed as:
Map (k1, v1) → list(k2, v2) // Formula 1 Reduce (k2, list(v2)) → list(v2) // Formula 2The shuffle stage moves map output to reducers, with separate shuffle handling on the mapper and reducer sides. While suitable for simple batch jobs like WordCount, MapReduce suffers from high disk I/O in iterative or multi‑stage workflows.
2.2 Tez To overcome MapReduce’s limitations for iterative jobs, Apache Tez provides a DAG‑based execution engine that can combine dependent jobs into a single DAG, greatly improving performance for batch workloads.
2.3 Spark Spark is an in‑memory, open‑source framework that caches intermediate results in RAM, reducing disk I/O and enabling fast iterative computation. Its DAG scheduler allows many operations to be fused, and the core abstraction is the Resilient Distributed Dataset (RDD), which supports transformations and actions. Spark excels for batch jobs on data sets smaller than roughly 1 TB, but can encounter memory‑related issues on larger scales.
3. Real‑Time Processing – Spark Streaming and Flink
3.1 Spark Streaming Spark Streaming extends the core Spark API to process continuous streams by dividing incoming data into micro‑batches (e.g., 1‑second intervals). Each micro‑batch is turned into an RDD and processed using the same transformation actions as batch Spark jobs, providing high throughput with fault tolerance.
3.2 Flink Apache Flink is a distributed framework designed primarily for true stream processing, treating batch jobs as a special case of streams. Its DataStream API lets users apply a rich set of operators to distributed streams, offering lower latency and more precise real‑time guarantees than Spark’s micro‑batch model.
4. Comparison and Conclusion Hadoop MapReduce remains the workhorse for massive batch jobs, while Spark offers faster, in‑memory processing for moderate‑size data sets and near‑real‑time micro‑batch streaming. Flink provides genuine real‑time stream processing with stronger latency guarantees but is less mature for batch workloads. Choosing the right engine depends on the specific requirements for batch size, latency, and iterative computation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
