Overview of Major Apache Big Data Processing Frameworks
This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.
Apache Ignite
Apache Ignite In‑Memory Data Fabric is a distributed memory platform for real‑time computation and processing of massive data sets. It offers distributed key‑value memory storage, SQL capabilities, map‑reduce, various distributed data structures, continuous queries, messaging, Hadoop and Spark integration, and provides Java, .NET, and C++ APIs.
Apache Ignite
Apache Ignite Documentation
Apache MapReduce
MapReduce is a programming model for processing large data sets on clusters using parallel distributed algorithms. Originating from Google MapReduce, the current Apache MapReduce version is built on the Apache YARN framework, which enables more generic execution models and allows applications that do not follow the MapReduce paradigm.
Apache MapReduce
Google MapReduce Paper
Writing YARN Applications
Apache Pig
Pig provides an engine for parallel execution of data flows on Hadoop and includes the Pig Latin language for expressing those flows. Pig Latin offers operators for traditional data operations such as join, sort, and filter, and allows users to develop custom functions for reading, processing, and writing data.
Pig compiles Pig Latin scripts into one or more MapReduce jobs, which are then executed. Although Pig Latin lacks explicit if‑statements or loops, it focuses on data‑flow semantics rather than control flow.
pig.apache.org/
Pig examples by Alan Gates
JAQL
JAQL is a functional declarative language designed for processing large volumes of structured, semi‑structured, and unstructured data, especially JSON documents, though it also supports XML, CSV, and plain files. Its "SQL‑like" feature lets programmers use structured SQL data alongside a flexible JSON data model.
JAQL enables selection, joins, grouping, and filtering of data stored in HDFS, combining capabilities of Pig and Hive. It draws inspiration from Lisp, SQL, XQuery, and Pig.
Created by IBM Research in 2008, JAQL is open‑source under the Apache 2.0 license and is integrated into IBM InfoSphere BigInsights for Hadoop, providing connectors to external data sources and machine‑learning services.
JAQL on Google Code
What is JAQL? by IBM
Apache Spark
Developed at UC Berkeley's AMPLab, Spark is a cluster‑computing framework built on Hadoop Distributed File System (HDFS) that offers a faster, in‑memory alternative to Hadoop MapReduce, often delivering up to ten‑fold performance improvements.
Spark provides a clean API for writing fast distributed programs, supporting interactive query analysis (Shark), large‑scale graph processing (Bagel), and real‑time analytics (Spark Streaming). It offers concise APIs in Scala, Java, and Python, and can be used interactively via shells.
Spark also powers Shark, a Hive‑compatible data warehouse system that runs up to 100 times faster than Hive.
Apache Spark
Mirror of Spark on GitHub
RDDs – Paper
Spark: Cluster Computing… – Paper
Apache Storm
Storm is a complex event processing (CEP) and distributed real‑time computation system written primarily in Clojure. It processes high‑velocity data streams using a master‑worker architecture coordinated by Zookeeper.
Storm utilizes ZeroMQ (0MQ), an embeddable networking library that provides message queuing without a dedicated broker.
Originally created by Nathan Marz and the BackType team, Storm was open‑sourced after Twitter’s acquisition of BackType in 2011.
Hortonworks has been developing a Storm‑on‑YARN version, and Twitter released a Hadoop‑Storm hybrid called Summingbird, which combines batch and stream processing.
Storm Project
Storm‑on‑YARN
Apache Flink
Apache Flink (formerly Stratosphere) offers powerful programming abstractions in Java and Scala, a high‑performance runtime, and automatic program optimizations, with native support for iterative and incremental computations.
Flink is a standalone data‑processing system that can operate independently of Hadoop but can also read from HDFS and run on YARN for resource management.
Apache Flink Incubator Page
Stratosphere Site
Apache Apex
Apache Apex is an enterprise‑grade, YARN‑based dynamic platform that unifies stream and batch processing with high scalability, performance, fault tolerance, stateful processing, and security. It provides a simple Java API for building data‑intensive applications.
The Apex‑Malhar library extends Apex with operators for accessing various storage systems (HDFS, S3, NFS, FTP), messaging systems (Kafka, ActiveMQ, RabbitMQ), and databases (MySQL, Cassandra, MongoDB, Redis, HBase, etc.), facilitating rapid application development.
Apex’s core technology underlies DataTorrent’s commercial RTS 3 product and related ingestion tools.
Apache Apex from DataTorrent
Apache Apex Main Page
Apache Apex Proposal
Netflix PigPen
PigPen is a Clojure map‑reduce library that compiles to Apache Pig. It allows Clojure functions—named or anonymous—to be used directly in Pig scripts, and is open‑sourced by Netflix.
PigPen on GitHub
AMPLab SIMR
SIMR enables users to run Apache Spark directly on Hadoop MapReduce v1 clusters without installing Spark or Scala on the nodes, simplifying deployment for environments lacking YARN.
SIMR on GitHub
Facebook Corona
Corona is Facebook’s next‑generation MapReduce implementation built on a custom Hadoop fork. It replaces the single JobTracker architecture with multiple per‑job trackers to improve scalability for very large data sets.
Corona on GitHub
Apache REEF
Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications on cluster resource managers such as Hadoop YARN or Apache Mesos. It centralizes control flow, provides an Evaluator runtime for task execution, supports multiple resource managers, and offers .NET and Java APIs.
Apache REEF Website
Apache Twill
Twill abstracts Apache Hadoop YARN to simplify distributed application development using a familiar thread‑based model for Java programmers. It acts as middleware between YARN and applications, handling API interactions and enabling easy construction of multi‑process distributed programs.
Apache Twill Incubator
Damballa (Parkour)
Parkour is a Clojure library that provides deep integration with Hadoop for writing MapReduce programs using standard Clojure functions, allowing full access to the underlying Java Hadoop APIs.
Apache Hama
Apache Hama is a top‑level open‑source project that supports advanced analytics beyond MapReduce, offering a bulk‑synchronous parallel (BSP) model suitable for iterative algorithms such as machine learning and graph processing.
Hama Site
Datasalt Pangool
Pangool introduces a higher‑level API for MapReduce jobs, providing a new paradigm that sits above Java for more expressive data processing.
Pangool
Pangool on GitHub
Apache Tez
Tez proposes a generic application framework for building complex DAG‑based data processing tasks that run natively on Apache Hadoop YARN. It offers developers higher performance and flexibility compared to traditional MapReduce, addressing use cases such as low‑latency SQL queries and machine learning workloads.
Apache Tez Incubator
Hortonworks Apache Tez Page
Apache DataFu
DataFu provides a collection of higher‑level Hadoop MapReduce jobs and user‑defined functions for data analysis, including statistical tasks, PageRank, sessionization, and set operations. It originated from LinkedIn’s Pig UDFs.
DataFu Apache Incubator
Pydoop
Pydoop is a Python library offering MapReduce and HDFS APIs for Hadoop, built on C++ pipes and the libhdfs C API. It enables full‑featured Python applications to access Hadoop’s functionality, surpassing the capabilities of Hadoop Streaming and Jython.
Pydoop Site
Pydoop GitHub Project
Kangaroo
Kangaroo, an open‑source project from Conductor, provides a MapReduce job that consumes Kafka data. Unlike solutions that limit each Kafka partition to a single InputSplit, Kangaroo can launch multiple consumers at different offsets within a single partition, increasing throughput and parallelism.
Kangaroo Introduction
Kangaroo GitHub Project
TinkerPop
TinkerPop is a Java‑based graph computing framework that defines a core API for graph system vendors. It supports various graph systems (in‑memory, OLTP, OLAP) and enables queries via the Gremlin traversal language and graph‑processing algorithms.
Apache TinkerPop Proposal
TinkerPop Website
Pachyderm MapReduce
Pachyderm is a new MapReduce engine built on Docker and CoreOS. In Pachyderm MapReduce (PMR), jobs run as Docker containers (micro‑services) that receive data via HTTP, process it, and store results back to the filesystem, automatically constructing a DAG of job dependencies.
Pachyderm Site
Pachyderm Introduction Article
Apache Beam
Apache Beam is an open‑source unified model for defining and executing data‑parallel processing pipelines, providing language‑specific SDKs and runners for various execution environments.
The Beam model originates from several internal Google projects, including MapReduce, FlumeJava, and Millwheel, and was first implemented as Google Cloud Dataflow. In January 2016, Google and partners submitted Beam as an Apache incubation proposal.
Apache Beam Proposal
DataFlow Beam and Spark Comparison
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
