How to Build a Real‑Time Spam Monitoring System with Apache Storm
This article walks through the design, deployment, and code implementation of a real‑time spam detection pipeline using Apache Storm, comparing it with Hadoop, detailing cluster setup, topology components, data flow, and how to package and run the solution on a distributed Storm cluster.
Overview
Apache Storm is an open‑source distributed real‑time computation system, whereas Hadoop focuses on batch processing via MapReduce on HDFS. Both share a master‑worker architecture, but Storm keeps topologies running continuously, making it suitable for low‑latency analytics such as spam monitoring.
Storm Fundamentals
Storm introduces the concepts of Stream (an unbounded sequence of tuples), Spout (data source) and Bolt (processing unit). A Topology connects spouts and bolts into a directed graph, and the whole topology is managed through a Thrift service, allowing language‑agnostic implementations.
Case Study: Real‑Time Spam Monitoring
A telecom company receives files containing suspected spam messages from each province. The legacy approach used a separate serial application per city, leading to backlog. The new design rewrites the pipeline with Storm to ingest files, parse each line, filter messages containing sensitive keywords (e.g., "racketeer", "Bad"), and store matching records in MySQL.
Cluster Deployment
The Storm cluster consists of one Nimbus node (192.168.95.134) and two Supervisor nodes (192.168.95.135 – slave1, 192.168.95.136 – slave2). Zookeeper runs on the same three machines. After starting Zookeeper, the services are launched with:
storm nimbus > /dev/null 2>&1 &
storm supervisor > /dev/null 2>&1 &The UI is started via storm ui > /dev/null 2>&1 & and accessed at http://192.168.95.134:8080. Screenshots of the UI and node status are shown below.
Topology Design
The topology includes two spouts ( SensitiveFileReader‑591 and SensitiveFileReader‑592) that read files from province‑specific directories. Each line follows the format:
home_city=591&user_id=5911000&msisdn=10000&sms_content=abc‑slave1
After the spouts, the SensitiveFileAnalyzer bolt parses the fields, and the SensitiveBatchBolt bolt matches the sms_content against the configured sensitive keywords. Matching records are persisted to MySQL via Hibernate. The overall topology diagram is illustrated below.
Implementation Details
The project structure is shown in the following image, followed by code snippets (provided as images) for the data model RubbishUsers, the spout implementation, the analyzer bolt, and the batch bolt. Hibernate configuration ( hibernate.cfg.xml) and mapping file ( rubbish‑users.hbm.xml) are also included, as well as the Spring bean definition ( jdbc‑hibernate‑bean.xml) that wires the session factory and a DBCP connection pool.
Running the Topology
The topology is packaged into a JAR and submitted with:
storm jar /home/tj/install/SensitiveTopology.jar newlandframework.storm.topology.SensitiveTopologyInput files are placed under /home/tj/data/591 and /home/tj/data/592. After submission, the UI shows the spouts and bolts with their executor counts and emitted tuples. Log files on the supervisors confirm that nine sensitive users were detected and successfully inserted into MySQL, matching the bolt’s monitoring output.
The solution demonstrates how adjusting parallelism and adding more worker nodes can scale the pipeline to handle larger data volumes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
