Big Data 5 min read

Overview of Open-Source Real-Time Stream Processing Systems

This article provides a concise overview of several open‑source real‑time stream processing platforms—including S4, Storm, StreamBase, HStreaming, Esper/NEsper, Kafka, Scribe, and Flume—highlighting their primary features, programming languages, and project links for future technical research.

Art of Distributed System Architecture Design

Jun 11, 2016

Overview of Open-Source Real-Time Stream Processing Systems

S4

S4 (Simple Scalable Streaming System) is an open‑source stream‑processing platform released by Yahoo. It is a general‑purpose, distributed, highly scalable system with partition fault‑tolerance and plugin support, allowing developers to build unbounded, continuous stream‑processing applications in Java.

Project link: http://incubator.apache.org/s4/ (Note: S4 0.5.0 adds TCP connectivity and state recovery features).

Storm

Storm, open‑sourced by Twitter, is a distributed real‑time computation system. Its simple API lets developers reliably process unbounded streams for use cases such as real‑time analytics, online machine learning, continuous computation, distributed RPC, and ETL. Development languages are Clojure and Java; other languages can interact via stdin/stdout using a JSON protocol.

Project link: http://storm-project.net

StreamBase

StreamBase is a Complex Event Processing (CEP) and event‑stream platform. Although it is commercial software, a Developer Edition is available, and applications are written in Java.

Project link: http://www.streambase.com

HStreaming

HStreaming is built on Hadoop and tightly integrates with the Hadoop ecosystem to provide real‑time stream processing services, enabling users to analyze and process big data within the same environment. Development language is Java.

Project link: http://www.hstreaming.com

Esper & NEsper

Esper (Java) and NEsper (.NET) are CEP platforms that allow developers to quickly develop and deploy applications handling large volumes of messages and events, both historical and real‑time.

Project link: http://esper.codehaus.org

Kafka

Kafka, open‑sourced by LinkedIn in December 2010, is a high‑throughput, publish‑subscribe distributed messaging system primarily used for handling active streaming data. It is written in Scala.

Project link: http://incubator.apache.org/kafka

Scribe

Scribe is Facebook’s open‑source log‑collection system written in C. Using Thrift, it supports many client languages. It aggregates logs from various sources into a central storage system (e.g., NFS, distributed file systems) for centralized analysis. Frequently paired with Hadoop, Scribe pushes logs to HDFS while Hadoop processes them via MapReduce.

Project link: http://github.com/facebook/scribe

Flume

Flume, provided by Cloudera, is a distributed, reliable, highly available log‑collection system for gathering, aggregating, and moving large volumes of log data. It is written in Java and allows custom data sources and sinks, as well as simple data transformations before delivery.

Project link: http://incubator.apache.org/flume

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time big data stream-processing apache

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.