Backend Development 12 min read

Mastering Apache Storm: Architecture, Components, and Real‑Time Processing Essentials

This article provides an in‑depth technical overview of Apache Storm, covering its core architecture, key components such as Nimbus, Supervisor, Worker, Executor, and Task, the role of ZooKeeper, high‑availability setup, API interfaces, code examples, grouping strategies, metrics, back‑pressure handling, and essential configuration parameters for building low‑latency stream processing topologies.

Manbang Technology Team

Jan 10, 2019

Mastering Apache Storm: Architecture, Components, and Real‑Time Processing Essentials

Overview

Apache Storm is a first‑generation real‑time stream processing framework that provides ultra‑low latency and high throughput. It is still widely deployed in production data platforms.

Key Characteristics

Simple programming model

Scalable across many nodes

High reliability and fault tolerance

Supports multiple languages (Java, Clojure, Python, etc.)

Local‑mode execution for development

Core Concepts

Nimbus : Global master that receives client submissions, parses topologies, and writes assignments to ZooKeeper.

Supervisor : Stateless daemon on each node that registers workers in ZooKeeper, receives assignments from Nimbus and launches workers.

Worker : Process launched by a supervisor; the smallest resource unit that runs one or more executors.

Executor : Thread pool inside a worker that runs a spout or bolt component. The number of executors is derived from the component’s parallelism setting.

Task : Individual thread that executes user code; a task belongs to an executor (often 1:1).

Spout : Source of streams in a topology.

Bolt : Processing unit that consumes streams, performs transformations, or persists results.

Topology : Directed acyclic graph (DAG) of spouts and bolts.

Architecture Flow

Nimbus parses the submitted topology and stores component assignments in ZooKeeper.

Supervisors register their host information and available worker slots in ZooKeeper and wait for assignments.

When a worker is idle, the supervisor starts it and the worker fetches its assigned executors from ZooKeeper.

Executors instantiate the spout or bolt logic; each worker may host multiple executors.

Tasks run the actual user code.

ZooKeeper Paths

/workerbeats/... – heartbeat information for each worker.

/storm/topology-id/stormBase – topology metadata.

/storm/assignments – assignment data for each topology.

/supervisors/supervisorInfo – supervisor registration data.

/errors/... – worker error logs displayed in the UI.

Nimbus High Availability

Two Nimbus instances run concurrently; ZooKeeper elects one as the active leader and the other as standby. Failover is automatic.

Supervisor Details

Stateless daemon that monitors workers, reports heartbeats, and receives task assignments from Nimbus.

Worker Details

Workers are launched by supervisors, retrieve assigned executors from ZooKeeper, and run them. Executors communicate via ZeroMQ within the same worker and via RPC across workers.

Executor and Task Configuration

Executor count is calculated from the total parallelism and the number of workers. Example: 10 workers with a total parallelism of 20 results in 2 executors per worker. The mapping of tasks to executors can be tuned via the storm.executor.tasks configuration.

API Components

Spout and Bolt interfaces (Java example):

public interface ISpout extends Serializable {
    void open(Map conf, TopologyContext context, SpoutOutputCollector collector);
    void close();
    void nextTuple();
    void ack(Object msgId);
    void fail(Object msgId);
}

public interface IBolt extends Serializable {
    void prepare(Map conf, TopologyContext context, OutputCollector collector);
    void execute(Tuple input);
    void cleanup();
    void declareOutputFields(OutputFieldsDeclarer declarer);
}

Message emission from a spout:

_collector.emit(new Values("field1", "field2", 3), msgId);

Anchoring example for tuple tracking:

List<Tuple> anchors = new ArrayList<Tuple>();
anchors.add(tuple1);
anchors.add(tuple2);
_collector.emit(anchors, new Values(1, 2, 3));

Sample Topology

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sentences", new KestrelSpout("kestrel.backtype.com", 22133, "sentence_queue", new StringScheme()));
builder.setBolt("split", new SplitSentence(), 10).shuffleGrouping("sentences");
builder.setBolt("count", new WordCount(), 20).fieldsGrouping("split", new Fields("word"));

Metrics

emitted – total tuples emitted.

acked – tuples successfully processed.

failed – tuples that timed out or failed.

capacity – concurrency suitability; values > 1 may indicate overload and risk of OOM.

execute latency – average time spent processing a tuple.

Grouping Strategies

shuffleGrouping – random distribution across target tasks.

fieldsGrouping – deterministic routing based on field values.

allGrouping – broadcast to all target tasks.

noneGrouping – same as shuffle (random).

localOrShuffleGrouping – prefers local executors before falling back to random.

globalGrouping – routes all tuples to the task with the lowest ID.

directGrouping – sender specifies the exact target task ID.

customGrouping – user‑defined routing logic.

Ack Mechanism

Spouts emit tuples with a message ID. Bolts forward tuples (optionally anchoring them). An internal acker bolt tracks the tree of tuples; when all anchored tuples are acked, the original tuple is considered fully processed.

Back‑Pressure

If an executor’s input queue exceeds the high‑water mark, a back‑pressure thread notifies ZooKeeper. All spouts listening to the back‑pressure event throttle tuple emission, preventing overload.

Configuration Parameters

Server‑side (storm.yaml) storm.zookeeper.servers – list of ZooKeeper hosts. storm.zookeeper.port – ZooKeeper client port. nimbus.seeds – Nimbus host list for client connections. storm.zookeeper.connection.timeout – connection timeout (ms). storm.zookeeper.session.timeout – session timeout (ms). storm.messaging.netty.client_worker_threads / storm.messaging.netty.server_worker_threads – Netty thread pool sizes. storm.local.dir – local storage directory. storm.log.dir – log directory. supervisor.slots.ports – list of ports used for worker slots. nimbus.childopts, supervisor.childopts, worker.childopts – JVM options for each process.

Topology‑level topology.workers – number of worker processes. topology.max.spout.pending – maximum pending tuples per spout. topology.backpressure.enable – enable/disable back‑pressure (default true). topology.acker.executors – number of acker executors. spout.parallelism / bolt.parallelism – parallelism hints for components.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems ZooKeeper Back-pressure Apache Storm Bolt Real-time Stream Processing topology Spout

Written by

Manbang Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.