Mastering Apache Storm: Architecture, Components, and Real‑Time Processing Essentials
This article provides an in‑depth technical overview of Apache Storm, covering its core architecture, key components such as Nimbus, Supervisor, Worker, Executor, and Task, the role of ZooKeeper, high‑availability setup, API interfaces, code examples, grouping strategies, metrics, back‑pressure handling, and essential configuration parameters for building low‑latency stream processing topologies.
Overview
Apache Storm is a first‑generation real‑time stream processing framework that provides ultra‑low latency and high throughput. It is still widely deployed in production data platforms.
Key Characteristics
Simple programming model
Scalable across many nodes
High reliability and fault tolerance
Supports multiple languages (Java, Clojure, Python, etc.)
Local‑mode execution for development
Core Concepts
Nimbus : Global master that receives client submissions, parses topologies, and writes assignments to ZooKeeper.
Supervisor : Stateless daemon on each node that registers workers in ZooKeeper, receives assignments from Nimbus and launches workers.
Worker : Process launched by a supervisor; the smallest resource unit that runs one or more executors.
Executor : Thread pool inside a worker that runs a spout or bolt component. The number of executors is derived from the component’s parallelism setting.
Task : Individual thread that executes user code; a task belongs to an executor (often 1:1).
Spout : Source of streams in a topology.
Bolt : Processing unit that consumes streams, performs transformations, or persists results.
Topology : Directed acyclic graph (DAG) of spouts and bolts.
Architecture Flow
Nimbus parses the submitted topology and stores component assignments in ZooKeeper.
Supervisors register their host information and available worker slots in ZooKeeper and wait for assignments.
When a worker is idle, the supervisor starts it and the worker fetches its assigned executors from ZooKeeper.
Executors instantiate the spout or bolt logic; each worker may host multiple executors.
Tasks run the actual user code.
ZooKeeper Paths
/workerbeats/... – heartbeat information for each worker.
/storm/topology-id/stormBase – topology metadata.
/storm/assignments – assignment data for each topology.
/supervisors/supervisorInfo – supervisor registration data.
/errors/... – worker error logs displayed in the UI.
Nimbus High Availability
Two Nimbus instances run concurrently; ZooKeeper elects one as the active leader and the other as standby. Failover is automatic.
Supervisor Details
Stateless daemon that monitors workers, reports heartbeats, and receives task assignments from Nimbus.
Worker Details
Workers are launched by supervisors, retrieve assigned executors from ZooKeeper, and run them. Executors communicate via ZeroMQ within the same worker and via RPC across workers.
Executor and Task Configuration
Executor count is calculated from the total parallelism and the number of workers. Example: 10 workers with a total parallelism of 20 results in 2 executors per worker. The mapping of tasks to executors can be tuned via the storm.executor.tasks configuration.
API Components
Spout and Bolt interfaces (Java example):
public interface ISpout extends Serializable {
void open(Map conf, TopologyContext context, SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}
public interface IBolt extends Serializable {
void prepare(Map conf, TopologyContext context, OutputCollector collector);
void execute(Tuple input);
void cleanup();
void declareOutputFields(OutputFieldsDeclarer declarer);
}Message emission from a spout:
_collector.emit(new Values("field1", "field2", 3), msgId);Anchoring example for tuple tracking:
List<Tuple> anchors = new ArrayList<Tuple>();
anchors.add(tuple1);
anchors.add(tuple2);
_collector.emit(anchors, new Values(1, 2, 3));Sample Topology
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sentences", new KestrelSpout("kestrel.backtype.com", 22133, "sentence_queue", new StringScheme()));
builder.setBolt("split", new SplitSentence(), 10).shuffleGrouping("sentences");
builder.setBolt("count", new WordCount(), 20).fieldsGrouping("split", new Fields("word"));Metrics
emitted – total tuples emitted.
acked – tuples successfully processed.
failed – tuples that timed out or failed.
capacity – concurrency suitability; values > 1 may indicate overload and risk of OOM.
execute latency – average time spent processing a tuple.
Grouping Strategies
shuffleGrouping – random distribution across target tasks.
fieldsGrouping – deterministic routing based on field values.
allGrouping – broadcast to all target tasks.
noneGrouping – same as shuffle (random).
localOrShuffleGrouping – prefers local executors before falling back to random.
globalGrouping – routes all tuples to the task with the lowest ID.
directGrouping – sender specifies the exact target task ID.
customGrouping – user‑defined routing logic.
Ack Mechanism
Spouts emit tuples with a message ID. Bolts forward tuples (optionally anchoring them). An internal acker bolt tracks the tree of tuples; when all anchored tuples are acked, the original tuple is considered fully processed.
Back‑Pressure
If an executor’s input queue exceeds the high‑water mark, a back‑pressure thread notifies ZooKeeper. All spouts listening to the back‑pressure event throttle tuple emission, preventing overload.
Configuration Parameters
Server‑side (storm.yaml) storm.zookeeper.servers – list of ZooKeeper hosts. storm.zookeeper.port – ZooKeeper client port. nimbus.seeds – Nimbus host list for client connections. storm.zookeeper.connection.timeout – connection timeout (ms). storm.zookeeper.session.timeout – session timeout (ms). storm.messaging.netty.client_worker_threads / storm.messaging.netty.server_worker_threads – Netty thread pool sizes. storm.local.dir – local storage directory. storm.log.dir – log directory. supervisor.slots.ports – list of ports used for worker slots. nimbus.childopts, supervisor.childopts, worker.childopts – JVM options for each process.
Topology‑level topology.workers – number of worker processes. topology.max.spout.pending – maximum pending tuples per spout. topology.backpressure.enable – enable/disable back‑pressure (default true). topology.acker.executors – number of acker executors. spout.parallelism / bolt.parallelism – parallelism hints for components.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
