Design Principles of Real-Time Distributed Streaming Systems: A Comparison of Spark and Storm
This article examines the design considerations of real-time distributed streaming systems, outlines their background and characteristics, compares the architectures of Spark Streaming and Storm, discusses primitives, message passing, high availability, storage models, and integration with production environments, providing practical insights for architects.
Background: The author is designing a real-time streaming distributed computing system and wants to share architecture ideas, referencing Storm and Spark for comparison.
Importance of streaming systems: They are crucial for online and nearline massive data processing in internet companies, requiring low latency and high reliability.
Characteristics of streaming systems: convenient user-defined logic, scale‑out design, no data loss, transparent fault tolerance, data persistence, and appropriate timeout settings.
Primitive design: Spark uses RDD/DAG, Storm uses Spout/Bolt topology; both provide abstractions for staged processing.
Design of Spark Streaming: It breaks streams into short micro‑batches, converting each batch into an RDD for processing, with examples such as WordCount.
Design of Storm: It defines a topology where Spouts are data sources and Bolts process tuples, with various grouping strategies (shuffle, fields, all, global, non, direct).
Message passing: Spark transforms RDDs into a DAG, schedules tasks, while Storm’s topology explicitly defines tuple routing using grouping modes.
High availability: Storm’s Nimbus is stateless with metadata in ZooKeeper; Spark achieves HA via ZooKeeper leader election for the master and can restart workers via containers.
Storage model and data loss: Discusses trade‑offs between persistence for reliability and performance, checkpointing, and metadata storage using ZooKeeper.
Integration with production environments: Highlights challenges of adopting open‑source streaming platforms in large internet companies and the need for custom extensions or building in‑house solutions.
Conclusion: Storm and Spark provide valuable design examples for streaming systems, and the author plans to further explore Spark’s source code in future posts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
