Big Data 16 min read

Attributes Matrix and Data Flow Models of Apache Streaming Platforms

This article presents a comprehensive attributes matrix and data‑flow model overview for major Apache streaming platforms, comparing versions, sponsors, event handling, fault tolerance, processing order, latency, resource management, APIs, and supported connectors to aid practical technology selection.

Architecture Digest

Mar 14, 2018

Attributes Matrix and Data Flow Models of Apache Streaming Platforms

Choosing many options is beneficial but can also be confusing; trying every platform is time‑consuming, so reading documentation helps quickly filter choices before deeper evaluation based on specific scenarios.

Technology has no "best", only the most suitable; selection should match requirements, project type, and team, balancing practicality with a touch of technical enthusiasm.

Attributes Matrix

In the article "Apache Streaming Projects Overview" I translated Janakiram's work and, after finding Ian Hellstrom's summary on InfoQ, converted the graphic into the following matrix, updating it where necessary.

Table 1: Quality Attributes of Streaming Platforms

Streaming Platform

Current Version

Main Sponsor

Event Size

Message Delivery Guarantee

State Management

Flume

1.8.0

Apple, Cloudera

single

at least once

transactional update

NiFi

1.5.0

Hortonworks

single

at least once

local & distributed snapshots

Gearpump

0.8.4

Intel, Lightbend

single

exactly once (or at least once if fault‑tolerance not needed)

checkpoints

Apex

Apex Core 3.6.0, Apex Malhar 3.8.0

Data Torrent

single

exactly once

checkpoints

Kafka Streams

1.0

Confluent

single

at least once

local & distributed snapshots

Spark Streaming

2.2.1

AMPLab, Databricks

micro‑batch

exactly once (or at least once if fault‑tolerance not needed)

checkpoints

Storm

1.1.1

Backtype, Twitter

single

at least once

record acknowledgements

Samza

0.14.0

single

at least once

local snapshots; distributed snapshots support fault‑tolerance

Flink

1.4.0

dataArtisans

single

exactly once

distributed snapshots

Ignite Streaming

2.3.0

GridGain

single

at least once

checkpoints

Beam

2.2.0

Google

single

exactly once

transactional update

Table 1 (continued): Quality Attributes

Streaming Platform

Fault Tolerance

Processing Order

Event Priority

Windowing

Back‑pressure

Flume

yes (only for file channel)

NiFi

yes

Gearpump

yes

programmable

time‑based

yes

Apex

yes

programmable

time‑based

yes

Kafka Streams

yes

programmable

time‑based

N/A

Spark Streaming

yes

programmable

time‑based

yes

Storm

yes

programmable

time‑based, count‑based

yes

Samza

yes

yes (not in single‑partition case)

programmable

time‑based, count‑based

yes

Flink

yes

programmable

time‑based, count‑based

yes

Ignite Streaming

yes

programmable

time‑based, count‑based

yes

Beam

yes

programmable

time‑based

yes

Table 1 (again): Quality Attributes

Streaming Platform

Data Abstraction

Data Flow

Latency

Resource Management

Auto‑scaling

Flume

Event

agent

low

native

NiFi

FlowFile

flow

configurable

native

Gearpump

Message

streaming application

very low

YARN

Apex

Tuple

streaming application

very low

YARN

yes

Kafka Streams

KafkaStream

process topology

very low

YARN, Mesos, Chef, Puppet, Salt, Kubernetes, etc.

yes

Spark Streaming

DStream

application

medium

YARN, Mesos

yes

Storm

Tuple

topology

very low

YARN, Mesos

Samza

Message

job

low

YARN

Flink

DataStream

streaming dataflow

low (configurable)

YARN

Ignite Streaming

IgniteDataStreamer

job

very low

YARN, Mesos

Beam

PCollection

pipeline

low

integrated

yes

Table 1 (final): Quality Attributes

Streaming Platform

Hot Modification

API

Main Development Language

API Language

Flume

declarative

Java

text files, Java

NiFi

yes

compositional

Java

REST (GUI)

Gearpump

yes

declarative

Scala

Scala, Java

Apex

yes

declarative

Java

Kafka Streams

yes

declarative

Java

Spark Streaming

declarative

Scala

Scala, Java, Python

Storm

yes

compositional

Clojure

Scala, Java, Clojure, Python, Ruby

Samza

compositional

Scala

Java

Flink

declarative

Java

Java, Scala, Python

Ignite Streaming

declarative

Java

Java, .NET, C++

Beam

declarative

Java

Data Flow Model

When processing streaming data, one must consume upstream sources and output processed data to storage for later analysis. This end‑to‑end movement is a data flow, and the design elements involved constitute a "Data Flow Model".

Different streaming platforms define their own data flow abstractions. Below is a brief summary for Flume, Flink, Storm, Apex, and NiFi.

Flume

Flume's data flow model consists of Source, Channel, and Sink within an Agent.

Built‑in Sources include: Avro, Thrift, JMS, Taildir, Exec, Spooling Directory, Twitter, Kafka, NetCat, Sequence Generator, Syslog, HTTP.

Built‑in Sinks include: HDFS, Hive, Logger, Avro, Thrift, IRC, File Roll, HBase, Solr, Elasticsearch, Kite Dataset, Kafka, HTTP.

Flume also supports custom Sources, Sinks, and Channels.

Flink

Flink abstracts the data flow model as Connectors, which link Sources and Sinks; some connectors are source‑only or sink‑only.

Supported connectors include: Kafka (Source/Sink), Elasticsearch (Sink), HDFS (Sink), RabbitMQ (Source/Sink), Amazon Kinesis Streams (Source/Sink), Twitter (Source), NiFi (Source/Sink), Cassandra (Sink), Redis, Flume, ActiveMQ (Sink).

Flink also allows user‑defined connectors.

Storm

Storm models data flow with Spouts (sources) and Bolts (processors). It integrates many external systems, providing corresponding Spouts and Bolts.

Integrated external systems include: Kafka, HBase, HDFS, Hive, Solr, Cassandra, JDBC, JMS, Redis, Event Hubs, Elasticsearch, MQTT, MongoDB, OpenTSDB, Kinesis, Druid, Kestrel, etc.

Both Storm and Storm Trident support custom Spouts and Bolts.

Apex

Apex calls its data flow elements Operators, separating them into Input Operators (Sources), Output Operators (Sinks), and Compute Operators (processing).

Apex Malhar supports Input/Output Operators for file systems (HDFS, S3, NFS, local), relational databases (Oracle, MySQL, SQLite), NoSQL databases (HBase, Cassandra, Accumulo, Aerospike, MongoDB, CouchDB), messaging systems (Kafka, JMS, ZeroMQ, RabbitMQ), notification systems (SMTP), in‑memory stores (Memcached, Redis), social media (Twitter), and protocols (HTTP, RSS, Socket, WebSocket, FTP, MQTT).

Apex also allows user‑defined Operators written in Java, JavaScript, Python, R, or Ruby.

NiFi

NiFi's primary abstraction is the Processor, offering a rich set of data sources and destinations.

Common data ingestion processors include: GetFile, GetFtp, GetSFtp, GetJMSQueue, GetJMSTopic, GetHTTP, ListenHTTP, ListenUDP, GetHDFS, ListHDFS/FetchHDFS, FetchS3Object, GetKafka, GetMongo, GetTwitter.

Data output processors include: PutEmail, PutFile, PutFTP, putSFTP, PutJMS, PutSQL, PutKafka, PutMongo.

NiFi also supports custom Processors by extending the AbstractProcessor class, which can be added to the GUI like built‑in processors.

Source: http://zhangyi.xyz/technical-choice-of-streaming-platform/

Copyright Notice: Content is sourced from the web and belongs to the original author. We credit the author and source where possible; please inform us of any infringement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data stream processing apache attributes matrix data flow model

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.