Master Apache Storm: Real‑Time Stream Processing from Basics to Word‑Count and Call‑Log Examples
This tutorial explains Apache Storm’s core principles, architecture, and development workflow, covering its relationship with Hadoop, key concepts such as spouts, bolts, tuples, and topologies, and provides step‑by‑step code examples for a word‑count program and a call‑log analysis application.
What Is Apache Storm?
Apache Storm is a distributed real‑time stream processing system designed for high reliability, fault tolerance, and scalability, enabling the processing of massive data streams with low latency.
Typical Use Cases
Log processing : Ingest event logs, filter them, and store matching records in a database.
E‑commerce recommendation : Analyze user behavior streams to generate personalized product recommendations.
Storm vs. Hadoop
While Hadoop excels at batch processing, Storm provides real‑time processing capabilities without built‑in storage. The two systems complement each other: Storm handles low‑latency streams, Hadoop handles large‑scale batch jobs.
How Storm Works
Storm programs consist of two main components:
Spout : The data source that reads external streams and emits tuples.
Bolt : The processing unit that receives tuples, performs transformations, aggregations, or database interactions, and optionally emits new tuples.
A topology connects spouts and bolts into a directed graph that defines the data flow.
Multiple bolts can be attached to a spout, creating flexible processing pipelines.
Building a Topology
Define where the data comes from (spout).
Define the processing logic and data flow between bolts.
Define where the processed data should be sent.
Example 1 – Word Count
This classic example demonstrates how to count word occurrences in a text stream.
Implementation Idea
Spout reads each line of text and emits it.
Split bolt tokenizes the line into words and emits each word.
Count bolt aggregates the occurrences of each word.
Report bolt outputs the final counts.
The topology wires the spout and bolts together and submits it to Storm.
Project Setup
Create a Maven project named storm-wordcount with a pom.xml that includes Storm dependencies.
Run the following commands in the project root:
mvn package mvn dependency:treeDirectory structure:
Key source files (shown as images in the original article): SpoutWords.java, BoltSplit.java, BoltCount.java, BoltReport.java, and WordCountTopology.java.
After building, run the topology; the console will display log messages followed by periodic word‑count results.
Example 2 – Call‑Log Statistics
This example processes call‑log records to compute call counts and total durations for each caller‑callee pair.
Implementation Idea
Spout reads raw call‑log lines (caller, callee, duration) and emits tuples.
Bolt1 reformats the data into from‑to keys.
Bolt2 aggregates counts and total duration per from‑to pair.
The topology connects the spout and bolts and submits the job.
Project Setup
Create a Maven project named storm-mobile with a pom.xml similar to the first example.
Run mvn package and mvn dependency:tree to install dependencies.
Key source files (shown as images): FakeCallLogReaderSpout.java, CallLogCreatorBolt.java, CallLogCounterBolt.java, and LogAnalyserStorm.java.
Execution produces extensive logs and finally prints aggregated call statistics.
Core Concepts
Spout – Data source : Reads external streams (e.g., Kafka, Twitter) and emits tuples.
Bolt – Processing unit : Receives tuples, applies logic (filter, aggregate, join, etc.), and may emit new tuples.
Tuple – Data unit : The basic record flowing through the topology.
Stream – Sequence of tuples : An unordered collection of tuples.
Topology – Directed graph : Defines the overall data‑flow logic of a Storm application.
Task – Execution of a spout or bolt : The smallest unit of work.
Worker – JVM process : Executes a set of tasks; multiple workers run across the cluster.
Executor – Thread inside a worker : Runs one or more tasks.
Cluster Architecture
A Storm cluster consists of a master node running the Nimbus daemon and multiple worker nodes running Supervisor daemons. Nimbus distributes code, assigns tasks, and monitors failures, while Supervisors launch and manage worker processes. Coordination is handled by a ZooKeeper ensemble.
Stream Grouping Strategies
Grouping determines how tuples are routed between tasks. Common strategies include:
Shuffle Grouping : Randomly distributes tuples evenly across target tasks.
Field Grouping : Routes tuples with the same field value to the same task (e.g., all tuples with the same user-id go to one task).
Global Grouping : Sends all tuples to a single target task.
All Grouping : Broadcasts each tuple to every target task.
In the word‑count example, bolt‑split uses shuffle grouping, bolt‑count uses fields grouping on the word field, and bolt‑report uses global grouping to aggregate results.
Key Takeaways
Spouts connect external data sources to the topology; their nextTuple() method continuously emits data.
Bolts implement processing logic in the execute() method.
The topology wires spouts and bolts, defines data flow, and is submitted to the Storm cluster for execution.
Understanding these fundamentals provides a solid foundation for building more complex real‑time analytics with Apache Storm.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
