Big Data 22 min read

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

Architects Research Society

Dec 30, 2018

Overview of Major Apache Big Data Processing Frameworks

Apache Ignite

Apache Ignite In‑Memory Data Fabric is a distributed memory platform for real‑time computation and processing of massive data sets. It offers distributed key‑value memory storage, SQL capabilities, map‑reduce, various distributed data structures, continuous queries, messaging, Hadoop and Spark integration, and provides Java, .NET, and C++ APIs.

Apache Ignite

Apache Ignite Documentation

Apache MapReduce

MapReduce is a programming model for processing large data sets on clusters using parallel distributed algorithms. Originating from Google MapReduce, the current Apache MapReduce version is built on the Apache YARN framework, which enables more generic execution models and allows applications that do not follow the MapReduce paradigm.

Apache MapReduce

Google MapReduce Paper

Writing YARN Applications

Apache Pig

Pig provides an engine for parallel execution of data flows on Hadoop and includes the Pig Latin language for expressing those flows. Pig Latin offers operators for traditional data operations such as join, sort, and filter, and allows users to develop custom functions for reading, processing, and writing data.

Pig compiles Pig Latin scripts into one or more MapReduce jobs, which are then executed. Although Pig Latin lacks explicit if‑statements or loops, it focuses on data‑flow semantics rather than control flow.

pig.apache.org/

Pig examples by Alan Gates

JAQL

JAQL is a functional declarative language designed for processing large volumes of structured, semi‑structured, and unstructured data, especially JSON documents, though it also supports XML, CSV, and plain files. Its "SQL‑like" feature lets programmers use structured SQL data alongside a flexible JSON data model.

JAQL enables selection, joins, grouping, and filtering of data stored in HDFS, combining capabilities of Pig and Hive. It draws inspiration from Lisp, SQL, XQuery, and Pig.

Created by IBM Research in 2008, JAQL is open‑source under the Apache 2.0 license and is integrated into IBM InfoSphere BigInsights for Hadoop, providing connectors to external data sources and machine‑learning services.

JAQL on Google Code

What is JAQL? by IBM

Apache Spark

Developed at UC Berkeley's AMPLab, Spark is a cluster‑computing framework built on Hadoop Distributed File System (HDFS) that offers a faster, in‑memory alternative to Hadoop MapReduce, often delivering up to ten‑fold performance improvements.

Spark provides a clean API for writing fast distributed programs, supporting interactive query analysis (Shark), large‑scale graph processing (Bagel), and real‑time analytics (Spark Streaming). It offers concise APIs in Scala, Java, and Python, and can be used interactively via shells.

Spark also powers Shark, a Hive‑compatible data warehouse system that runs up to 100 times faster than Hive.

Apache Spark

Mirror of Spark on GitHub

RDDs – Paper

Spark: Cluster Computing… – Paper

Apache Storm

Storm is a complex event processing (CEP) and distributed real‑time computation system written primarily in Clojure. It processes high‑velocity data streams using a master‑worker architecture coordinated by Zookeeper.

Storm utilizes ZeroMQ (0MQ), an embeddable networking library that provides message queuing without a dedicated broker.

Originally created by Nathan Marz and the BackType team, Storm was open‑sourced after Twitter’s acquisition of BackType in 2011.

Hortonworks has been developing a Storm‑on‑YARN version, and Twitter released a Hadoop‑Storm hybrid called Summingbird, which combines batch and stream processing.

Storm Project

Storm‑on‑YARN

Apache Flink

Apache Flink (formerly Stratosphere) offers powerful programming abstractions in Java and Scala, a high‑performance runtime, and automatic program optimizations, with native support for iterative and incremental computations.

Flink is a standalone data‑processing system that can operate independently of Hadoop but can also read from HDFS and run on YARN for resource management.

Apache Flink Incubator Page

Stratosphere Site

Apache Apex

Apache Apex is an enterprise‑grade, YARN‑based dynamic platform that unifies stream and batch processing with high scalability, performance, fault tolerance, stateful processing, and security. It provides a simple Java API for building data‑intensive applications.

The Apex‑Malhar library extends Apex with operators for accessing various storage systems (HDFS, S3, NFS, FTP), messaging systems (Kafka, ActiveMQ, RabbitMQ), and databases (MySQL, Cassandra, MongoDB, Redis, HBase, etc.), facilitating rapid application development.

Apex’s core technology underlies DataTorrent’s commercial RTS 3 product and related ingestion tools.

Apache Apex from DataTorrent

Apache Apex Main Page

Apache Apex Proposal

Netflix PigPen

PigPen is a Clojure map‑reduce library that compiles to Apache Pig. It allows Clojure functions—named or anonymous—to be used directly in Pig scripts, and is open‑sourced by Netflix.

PigPen on GitHub

AMPLab SIMR

SIMR enables users to run Apache Spark directly on Hadoop MapReduce v1 clusters without installing Spark or Scala on the nodes, simplifying deployment for environments lacking YARN.

SIMR on GitHub

Facebook Corona

Corona is Facebook’s next‑generation MapReduce implementation built on a custom Hadoop fork. It replaces the single JobTracker architecture with multiple per‑job trackers to improve scalability for very large data sets.

Corona on GitHub

Apache REEF

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications on cluster resource managers such as Hadoop YARN or Apache Mesos. It centralizes control flow, provides an Evaluator runtime for task execution, supports multiple resource managers, and offers .NET and Java APIs.

Apache REEF Website

Apache Twill

Twill abstracts Apache Hadoop YARN to simplify distributed application development using a familiar thread‑based model for Java programmers. It acts as middleware between YARN and applications, handling API interactions and enabling easy construction of multi‑process distributed programs.

Apache Twill Incubator

Damballa (Parkour)

Parkour is a Clojure library that provides deep integration with Hadoop for writing MapReduce programs using standard Clojure functions, allowing full access to the underlying Java Hadoop APIs.

Apache Hama

Apache Hama is a top‑level open‑source project that supports advanced analytics beyond MapReduce, offering a bulk‑synchronous parallel (BSP) model suitable for iterative algorithms such as machine learning and graph processing.

Hama Site

Datasalt Pangool

Pangool introduces a higher‑level API for MapReduce jobs, providing a new paradigm that sits above Java for more expressive data processing.

Pangool

Pangool on GitHub

Apache Tez

Tez proposes a generic application framework for building complex DAG‑based data processing tasks that run natively on Apache Hadoop YARN. It offers developers higher performance and flexibility compared to traditional MapReduce, addressing use cases such as low‑latency SQL queries and machine learning workloads.

Apache Tez Incubator

Hortonworks Apache Tez Page

Apache DataFu

DataFu provides a collection of higher‑level Hadoop MapReduce jobs and user‑defined functions for data analysis, including statistical tasks, PageRank, sessionization, and set operations. It originated from LinkedIn’s Pig UDFs.

DataFu Apache Incubator

Pydoop

Pydoop is a Python library offering MapReduce and HDFS APIs for Hadoop, built on C++ pipes and the libhdfs C API. It enables full‑featured Python applications to access Hadoop’s functionality, surpassing the capabilities of Hadoop Streaming and Jython.

Pydoop Site

Pydoop GitHub Project

Kangaroo

Kangaroo, an open‑source project from Conductor, provides a MapReduce job that consumes Kafka data. Unlike solutions that limit each Kafka partition to a single InputSplit, Kangaroo can launch multiple consumers at different offsets within a single partition, increasing throughput and parallelism.

Kangaroo Introduction

Kangaroo GitHub Project

TinkerPop

TinkerPop is a Java‑based graph computing framework that defines a core API for graph system vendors. It supports various graph systems (in‑memory, OLTP, OLAP) and enables queries via the Gremlin traversal language and graph‑processing algorithms.

Apache TinkerPop Proposal

TinkerPop Website

Pachyderm MapReduce

Pachyderm is a new MapReduce engine built on Docker and CoreOS. In Pachyderm MapReduce (PMR), jobs run as Docker containers (micro‑services) that receive data via HTTP, process it, and store results back to the filesystem, automatically constructing a DAG of job dependencies.

Pachyderm Site

Pachyderm Introduction Article

Apache Beam

Apache Beam is an open‑source unified model for defining and executing data‑parallel processing pipelines, providing language‑specific SDKs and runners for various execution environments.

The Beam model originates from several internal Google projects, including MapReduce, FlumeJava, and Millwheel, and was first implemented as Google Cloud Dataflow. In January 2016, Google and partners submitted Beam as an Apache incubation proposal.

Apache Beam Proposal

DataFlow Beam and Spark Comparison

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Apache MapReduce distributed computing Spark

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.