Storm vs Spark: Which Real‑Time Analytics Platform Wins for Your Business?
The article compares Apache Storm and Apache Spark, examining their origins, architecture, language support, integration capabilities, and performance characteristics, and offers guidance on selecting the right platform for real‑time business intelligence based on specific workload and infrastructure needs.
Background
Real‑time business intelligence has been discussed since at least 2006. Traditional data warehouses are batch‑oriented, causing high latency or high cost. Open‑source stream‑processing platforms such as Apache Storm and Apache Spark provide low‑latency alternatives.
Apache Storm
Storm is a distributed stream‑processing framework originally created by BackType, open‑sourced by Twitter in 2011 and graduated to a top‑level Apache project in 2014. It is implemented primarily in Clojure on the JVM and supports Java, Python, Ruby, and other languages via multi‑language bindings.
Key architectural concepts:
Topology : a directed acyclic graph (DAG) of spouts (sources) and bolts (processing units) that runs continuously.
Fault tolerance : each tuple is guaranteed “at‑least‑once” processing; exactly‑once semantics can be achieved with the Trident API.
Scalability : the scheduler distributes tasks across a cluster; failed workers are automatically restarted.
Storm integrates with many data sources, e.g., Twitter Streaming API, Apache Kafka, JMS brokers, and HDFS. Because the processing logic runs in separate processes, non‑JVM languages can be used as long as they communicate via JSON over STDIN/STDOUT.
Apache Spark
Spark originated in UC‑Berkeley’s AMPLab, entered the Apache incubator in 2013 and became a top‑level project in February 2014. It is written in Scala and provides native APIs for Scala, Java, and Python.
Core components:
Unified engine : runs on Hadoop YARN, Apache Mesos, or its own standalone cluster manager.
Storage adapters : can read/write HDFS, Cassandra, HBase, Amazon S3, and other systems.
Modules : Spark SQL, Spark Streaming (micro‑batch DStreams), Structured Streaming (continuous processing), GraphX, and MLlib.
Interactive shell : spark‑shell (Scala) and pyspark (Python) enable rapid prototyping.
Spark’s processing model is based on resilient distributed datasets (RDDs) and DataFrames, which allow both batch and streaming workloads to share the same code base. Spark has demonstrated petabyte‑scale performance in benchmarks such as the 2014 Daytona GraySort (100 TB).
Comparison and Selection Guidance
Use Storm when:
Ultra‑low latency (sub‑second) processing is required.
The workload is primarily event‑driven or requires complex event processing (CEP).
A dedicated cluster can be provisioned and language flexibility (including non‑JVM languages) is important.
Use Spark when:
An existing Hadoop or Mesos environment is available.
You need a single platform for batch, streaming, SQL, graph, and machine‑learning workloads.
Interactive data exploration via Spark shell is valuable.
Both systems can be combined with complementary tools such as Apache Kafka (messaging), Hadoop/HDFS (persistent storage), and Apache Flume (data ingestion) to build hybrid pipelines.
Practical Recommendation
Implement a small proof‑of‑concept on each platform, run representative workloads, and measure latency, throughput, and resource utilization before committing to a final architecture.
Reference URL: http://www.infoworld.com/article/2854894/application-development/spark-and-storm-for-real-time-computation.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
