Big Data 13 min read

How to Build a Real‑Time Spam Monitoring System with Apache Storm

This article walks through the design, deployment, and code implementation of a real‑time spam detection pipeline using Apache Storm, comparing it with Hadoop, detailing cluster setup, topology components, data flow, and how to package and run the solution on a distributed Storm cluster.

dbaplus Community

Oct 30, 2017

How to Build a Real‑Time Spam Monitoring System with Apache Storm

Overview

Apache Storm is an open‑source distributed real‑time computation system, whereas Hadoop focuses on batch processing via MapReduce on HDFS. Both share a master‑worker architecture, but Storm keeps topologies running continuously, making it suitable for low‑latency analytics such as spam monitoring.

Storm Fundamentals

Storm introduces the concepts of Stream (an unbounded sequence of tuples), Spout (data source) and Bolt (processing unit). A Topology connects spouts and bolts into a directed graph, and the whole topology is managed through a Thrift service, allowing language‑agnostic implementations.

Case Study: Real‑Time Spam Monitoring

A telecom company receives files containing suspected spam messages from each province. The legacy approach used a separate serial application per city, leading to backlog. The new design rewrites the pipeline with Storm to ingest files, parse each line, filter messages containing sensitive keywords (e.g., "racketeer", "Bad"), and store matching records in MySQL.

Cluster Deployment

The Storm cluster consists of one Nimbus node (192.168.95.134) and two Supervisor nodes (192.168.95.135 – slave1, 192.168.95.136 – slave2). Zookeeper runs on the same three machines. After starting Zookeeper, the services are launched with:

storm nimbus > /dev/null 2>&1 &
storm supervisor > /dev/null 2>&1 &

The UI is started via storm ui > /dev/null 2>&1 & and accessed at http://192.168.95.134:8080. Screenshots of the UI and node status are shown below.

Topology Design

The topology includes two spouts ( SensitiveFileReader‑591 and SensitiveFileReader‑592) that read files from province‑specific directories. Each line follows the format:

home_city=591&user_id=5911000&msisdn=10000&sms_content=abc‑slave1

After the spouts, the SensitiveFileAnalyzer bolt parses the fields, and the SensitiveBatchBolt bolt matches the sms_content against the configured sensitive keywords. Matching records are persisted to MySQL via Hibernate. The overall topology diagram is illustrated below.

Implementation Details

The project structure is shown in the following image, followed by code snippets (provided as images) for the data model RubbishUsers, the spout implementation, the analyzer bolt, and the batch bolt. Hibernate configuration ( hibernate.cfg.xml) and mapping file ( rubbish‑users.hbm.xml) are also included, as well as the Spring bean definition ( jdbc‑hibernate‑bean.xml) that wires the session factory and a DBCP connection pool.

Running the Topology

The topology is packaged into a JAR and submitted with:

storm jar /home/tj/install/SensitiveTopology.jar newlandframework.storm.topology.SensitiveTopology

Input files are placed under /home/tj/data/591 and /home/tj/data/592. After submission, the UI shows the spouts and bolts with their executor counts and emitted tuples. Log files on the supervisors confirm that nine sensitive users were detected and successfully inserted into MySQL, matching the bolt’s monitoring output.

The solution demonstrates how adjusting parallelism and adding more worker nodes can scale the pipeline to handle larger data volumes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Big Data Real-time Processing Apache Storm Hibernate spam detection

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.