Big Data 5 min read

Overview of Open-Source Real-Time Stream Processing Systems

This article provides a concise overview of several open‑source real‑time stream processing platforms—including S4, Storm, StreamBase, HStreaming, Esper/NEsper, Kafka, Scribe, and Flume—highlighting their main features, programming languages, and project links for further reference.

Art of Distributed System Architecture Design

Sep 23, 2015

Overview of Open-Source Real-Time Stream Processing Systems

S4

S4 (Simple Scalable Streaming System) is an open‑source stream computing platform released by Yahoo, offering a generic, distributed, highly scalable, partition‑tolerant, and pluggable environment for continuous data processing, with Java as the development language. Project link: http://incubator.apache.org/s4/ (note: S4 0.5.0 adds TCP connections and state recovery).

Storm

Storm, open‑sourced by Twitter, is a distributed real‑time computation system that enables developers to reliably process unbounded streams via a simple API, supporting Java and Clojure (other languages can interact via stdin/stdout using a JSON protocol). Typical use cases include real‑time analytics, online machine learning, continuous computation, distributed RPC, and ETL. Project link: http://storm-project.net .

StreamBase

StreamBase is a commercial complex event processing (CEP) and event‑stream platform that also offers a free Developer Edition; development is done in Java. Project link: http://www.streambase.com .

HStreaming

Built on Hadoop, HStreaming tightly integrates with the Hadoop ecosystem to provide real‑time stream computing services, allowing users to analyze and process big data within the same environment; development language is Java. Project link: http://www.hstreaming.com .

Esper & NEsper

Esper (Java) and NEsper (.NET) are CEP platforms that simplify the development and deployment of applications handling large volumes of historical or real‑time messages and events. Project link: http://esper.codehaus.org .

Kafka

Kafka, open‑sourced by LinkedIn in December 2010, is a high‑throughput, publish‑subscribe distributed messaging system primarily used for handling active streaming data, written in Scala. Project link: http://incubator.apache.org/kafka .

Scribe

Scribe is Facebook’s open‑source log collection system written in C, supporting multiple client languages via Thrift. It aggregates logs from various sources into a central storage (e.g., NFS, distributed file systems) for centralized analysis, often used together with Hadoop for downstream processing. Project link: http://github.com/facebook/scribe .

Flume

Flume, provided by Cloudera, is a distributed, reliable, highly available log collection system for gathering, aggregating, and moving large volumes of log data, implemented in Java. It allows custom data sources and sinks, and can perform simple processing before delivering data to various destinations. Project link: http://incubator.apache.org/flume .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time big data stream-processing Kafka open-source Storm

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.