Big Data 14 min read

Master Apache Storm: Real‑Time Stream Processing from Basics to Word‑Count and Call‑Log Examples

This tutorial explains Apache Storm’s core principles, architecture, and development workflow, covering its relationship with Hadoop, key concepts such as spouts, bolts, tuples, and topologies, and provides step‑by‑step code examples for a word‑count program and a call‑log analysis application.

dbaplus Community
dbaplus Community
dbaplus Community
Master Apache Storm: Real‑Time Stream Processing from Basics to Word‑Count and Call‑Log Examples

What Is Apache Storm?

Apache Storm is a distributed real‑time stream processing system designed for high reliability, fault tolerance, and scalability, enabling the processing of massive data streams with low latency.

Typical Use Cases

Log processing : Ingest event logs, filter them, and store matching records in a database.

E‑commerce recommendation : Analyze user behavior streams to generate personalized product recommendations.

Storm vs. Hadoop

While Hadoop excels at batch processing, Storm provides real‑time processing capabilities without built‑in storage. The two systems complement each other: Storm handles low‑latency streams, Hadoop handles large‑scale batch jobs.

How Storm Works

Storm programs consist of two main components:

Spout : The data source that reads external streams and emits tuples.

Bolt : The processing unit that receives tuples, performs transformations, aggregations, or database interactions, and optionally emits new tuples.

A topology connects spouts and bolts into a directed graph that defines the data flow.

Storm topology diagram
Storm topology diagram

Multiple bolts can be attached to a spout, creating flexible processing pipelines.

Bolt connections
Bolt connections

Building a Topology

Define where the data comes from (spout).

Define the processing logic and data flow between bolts.

Define where the processed data should be sent.

Example 1 – Word Count

This classic example demonstrates how to count word occurrences in a text stream.

Implementation Idea

Spout reads each line of text and emits it.

Split bolt tokenizes the line into words and emits each word.

Count bolt aggregates the occurrences of each word.

Report bolt outputs the final counts.

The topology wires the spout and bolts together and submits it to Storm.

Project Setup

Create a Maven project named storm-wordcount with a pom.xml that includes Storm dependencies.

Run the following commands in the project root:

mvn package
mvn dependency:tree

Directory structure:

Project directory layout
Project directory layout

Key source files (shown as images in the original article): SpoutWords.java, BoltSplit.java, BoltCount.java, BoltReport.java, and WordCountTopology.java.

After building, run the topology; the console will display log messages followed by periodic word‑count results.

Example 2 – Call‑Log Statistics

This example processes call‑log records to compute call counts and total durations for each caller‑callee pair.

Implementation Idea

Spout reads raw call‑log lines (caller, callee, duration) and emits tuples.

Bolt1 reformats the data into from‑to keys.

Bolt2 aggregates counts and total duration per from‑to pair.

The topology connects the spout and bolts and submits the job.

Project Setup

Create a Maven project named storm-mobile with a pom.xml similar to the first example.

Run mvn package and mvn dependency:tree to install dependencies.

Key source files (shown as images): FakeCallLogReaderSpout.java, CallLogCreatorBolt.java, CallLogCounterBolt.java, and LogAnalyserStorm.java.

Execution produces extensive logs and finally prints aggregated call statistics.

Core Concepts

Spout – Data source : Reads external streams (e.g., Kafka, Twitter) and emits tuples.

Bolt – Processing unit : Receives tuples, applies logic (filter, aggregate, join, etc.), and may emit new tuples.

Tuple – Data unit : The basic record flowing through the topology.

Stream – Sequence of tuples : An unordered collection of tuples.

Topology – Directed graph : Defines the overall data‑flow logic of a Storm application.

Task – Execution of a spout or bolt : The smallest unit of work.

Worker – JVM process : Executes a set of tasks; multiple workers run across the cluster.

Executor – Thread inside a worker : Runs one or more tasks.

Cluster Architecture

A Storm cluster consists of a master node running the Nimbus daemon and multiple worker nodes running Supervisor daemons. Nimbus distributes code, assigns tasks, and monitors failures, while Supervisors launch and manage worker processes. Coordination is handled by a ZooKeeper ensemble.

Storm cluster diagram
Storm cluster diagram

Stream Grouping Strategies

Grouping determines how tuples are routed between tasks. Common strategies include:

Shuffle Grouping : Randomly distributes tuples evenly across target tasks.

Field Grouping : Routes tuples with the same field value to the same task (e.g., all tuples with the same user-id go to one task).

Global Grouping : Sends all tuples to a single target task.

All Grouping : Broadcasts each tuple to every target task.

In the word‑count example, bolt‑split uses shuffle grouping, bolt‑count uses fields grouping on the word field, and bolt‑report uses global grouping to aggregate results.

Key Takeaways

Spouts connect external data sources to the topology; their nextTuple() method continuously emits data.

Bolts implement processing logic in the execute() method.

The topology wires spouts and bolts, defines data flow, and is submitted to the Storm cluster for execution.

Understanding these fundamentals provides a solid foundation for building more complex real‑time analytics with Apache Storm.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time Processingstream computingApache Stormword countcall log analysis
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.