Attributes Matrix and Data Flow Models of Apache Streaming Platforms
This article presents a comprehensive attributes matrix and data‑flow model overview for major Apache streaming platforms, comparing versions, sponsors, event handling, fault tolerance, processing order, latency, resource management, APIs, and supported connectors to aid practical technology selection.
Choosing many options is beneficial but can also be confusing; trying every platform is time‑consuming, so reading documentation helps quickly filter choices before deeper evaluation based on specific scenarios.
Technology has no "best", only the most suitable; selection should match requirements, project type, and team, balancing practicality with a touch of technical enthusiasm.
Attributes Matrix
In the article "Apache Streaming Projects Overview" I translated Janakiram's work and, after finding Ian Hellstrom's summary on InfoQ, converted the graphic into the following matrix, updating it where necessary.
Table 1: Quality Attributes of Streaming Platforms
Streaming Platform
Current Version
Main Sponsor
Event Size
Message Delivery Guarantee
State Management
Flume
1.8.0
Apple, Cloudera
single
at least once
transactional update
NiFi
1.5.0
Hortonworks
single
at least once
local & distributed snapshots
Gearpump
0.8.4
Intel, Lightbend
single
exactly once (or at least once if fault‑tolerance not needed)
checkpoints
Apex
Apex Core 3.6.0, Apex Malhar 3.8.0
Data Torrent
single
exactly once
checkpoints
Kafka Streams
1.0
Confluent
single
at least once
local & distributed snapshots
Spark Streaming
2.2.1
AMPLab, Databricks
micro‑batch
exactly once (or at least once if fault‑tolerance not needed)
checkpoints
Storm
1.1.1
Backtype, Twitter
single
at least once
record acknowledgements
Samza
0.14.0
single
at least once
local snapshots; distributed snapshots support fault‑tolerance
Flink
1.4.0
dataArtisans
single
exactly once
distributed snapshots
Ignite Streaming
2.3.0
GridGain
single
at least once
checkpoints
Beam
2.2.0
single
exactly once
transactional update
Table 1 (continued): Quality Attributes
Streaming Platform
Fault Tolerance
Processing Order
Event Priority
Windowing
Back‑pressure
Flume
yes (only for file channel)
no
no
no
no
NiFi
yes
no
yes
no
yes
Gearpump
yes
yes
programmable
time‑based
yes
Apex
yes
no
programmable
time‑based
yes
Kafka Streams
yes
yes
programmable
time‑based
N/A
Spark Streaming
yes
no
programmable
time‑based
yes
Storm
yes
yes
programmable
time‑based, count‑based
yes
Samza
yes
yes (not in single‑partition case)
programmable
time‑based, count‑based
yes
Flink
yes
yes
programmable
time‑based, count‑based
yes
Ignite Streaming
yes
yes
programmable
time‑based, count‑based
yes
Beam
yes
yes
programmable
time‑based
yes
Table 1 (again): Quality Attributes
Streaming Platform
Data Abstraction
Data Flow
Latency
Resource Management
Auto‑scaling
Flume
Event
agent
low
native
no
NiFi
FlowFile
flow
configurable
native
no
Gearpump
Message
streaming application
very low
YARN
no
Apex
Tuple
streaming application
very low
YARN
yes
Kafka Streams
KafkaStream
process topology
very low
YARN, Mesos, Chef, Puppet, Salt, Kubernetes, etc.
yes
Spark Streaming
DStream
application
medium
YARN, Mesos
yes
Storm
Tuple
topology
very low
YARN, Mesos
no
Samza
Message
job
low
YARN
no
Flink
DataStream
streaming dataflow
low (configurable)
YARN
no
Ignite Streaming
IgniteDataStreamer
job
very low
YARN, Mesos
no
Beam
PCollection
pipeline
low
integrated
yes
Table 1 (final): Quality Attributes
Streaming Platform
Hot Modification
API
Main Development Language
API Language
Flume
no
declarative
Java
text files, Java
NiFi
yes
compositional
Java
REST (GUI)
Gearpump
yes
declarative
Scala
Scala, Java
Apex
yes
declarative
Java
Java
Kafka Streams
yes
declarative
Java
Java
Spark Streaming
no
declarative
Scala
Scala, Java, Python
Storm
yes
compositional
Clojure
Scala, Java, Clojure, Python, Ruby
Samza
no
compositional
Scala
Java
Flink
no
declarative
Java
Java, Scala, Python
Ignite Streaming
no
declarative
Java
Java, .NET, C++
Beam
no
declarative
Java
Java
Data Flow Model
When processing streaming data, one must consume upstream sources and output processed data to storage for later analysis. This end‑to‑end movement is a data flow, and the design elements involved constitute a "Data Flow Model".
Different streaming platforms define their own data flow abstractions. Below is a brief summary for Flume, Flink, Storm, Apex, and NiFi.
Flume
Flume's data flow model consists of Source, Channel, and Sink within an Agent.
Built‑in Sources include: Avro, Thrift, JMS, Taildir, Exec, Spooling Directory, Twitter, Kafka, NetCat, Sequence Generator, Syslog, HTTP.
Built‑in Sinks include: HDFS, Hive, Logger, Avro, Thrift, IRC, File Roll, HBase, Solr, Elasticsearch, Kite Dataset, Kafka, HTTP.
Flume also supports custom Sources, Sinks, and Channels.
Flink
Flink abstracts the data flow model as Connectors, which link Sources and Sinks; some connectors are source‑only or sink‑only.
Supported connectors include: Kafka (Source/Sink), Elasticsearch (Sink), HDFS (Sink), RabbitMQ (Source/Sink), Amazon Kinesis Streams (Source/Sink), Twitter (Source), NiFi (Source/Sink), Cassandra (Sink), Redis, Flume, ActiveMQ (Sink).
Flink also allows user‑defined connectors.
Storm
Storm models data flow with Spouts (sources) and Bolts (processors). It integrates many external systems, providing corresponding Spouts and Bolts.
Integrated external systems include: Kafka, HBase, HDFS, Hive, Solr, Cassandra, JDBC, JMS, Redis, Event Hubs, Elasticsearch, MQTT, MongoDB, OpenTSDB, Kinesis, Druid, Kestrel, etc.
Both Storm and Storm Trident support custom Spouts and Bolts.
Apex
Apex calls its data flow elements Operators, separating them into Input Operators (Sources), Output Operators (Sinks), and Compute Operators (processing).
Apex Malhar supports Input/Output Operators for file systems (HDFS, S3, NFS, local), relational databases (Oracle, MySQL, SQLite), NoSQL databases (HBase, Cassandra, Accumulo, Aerospike, MongoDB, CouchDB), messaging systems (Kafka, JMS, ZeroMQ, RabbitMQ), notification systems (SMTP), in‑memory stores (Memcached, Redis), social media (Twitter), and protocols (HTTP, RSS, Socket, WebSocket, FTP, MQTT).
Apex also allows user‑defined Operators written in Java, JavaScript, Python, R, or Ruby.
NiFi
NiFi's primary abstraction is the Processor, offering a rich set of data sources and destinations.
Common data ingestion processors include: GetFile, GetFtp, GetSFtp, GetJMSQueue, GetJMSTopic, GetHTTP, ListenHTTP, ListenUDP, GetHDFS, ListHDFS/FetchHDFS, FetchS3Object, GetKafka, GetMongo, GetTwitter.
Data output processors include: PutEmail, PutFile, PutFTP, putSFTP, PutJMS, PutSQL, PutKafka, PutMongo.
NiFi also supports custom Processors by extending the AbstractProcessor class, which can be added to the GUI like built‑in processors.
Source: http://zhangyi.xyz/technical-choice-of-streaming-platform/
Copyright Notice: Content is sourced from the web and belongs to the original author. We credit the author and source where possible; please inform us of any infringement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
