Comprehensive Overview and Best Practices for Apache Spark Streaming
This article provides a detailed introduction to Spark Streaming, covering its architecture, DStream concepts, initialization, data sources, transformations, windowed aggregations, output operations, checkpointing, fault‑tolerance semantics, deployment, performance tuning, and monitoring for building reliable high‑throughput streaming applications.
Spark Streaming (referred to as streaming) is an extension of Spark Core that offers a highly extensible, high‑throughput, and fault‑tolerant stream processing system.
Streaming receives input data from sources such as Kafka, divides the continuous stream into micro‑batches according to a configured batch interval, and processes each small dataset using Spark Core, SQL, or MLlib before emitting the results.
A DStream (discretized stream) represents a sequence of RDDs; it is the core abstraction for Spark Streaming.
Basic setup : after linking required JARs, a StreamingContext can be created from an existing SparkContext: val ssc = new StreamingContext(sc, Seconds(1)) Once the ssc is created, the typical workflow is:
Define input DStreams (the start point).
Apply transformation and output operations (the processing steps).
Start the streaming job with ssc.start().
Block the driver with ssc.awaitTermination(-1L) until the job finishes or fails.
Stop the context with ssc.stop().
Only one StreamingContext may exist per JVM, and a SparkContext can be reused after the previous ssc is stopped.
Receivers : Input DStreams receive data via Receivers, which store incoming data in Spark’s memory for subsequent transformations. Built‑in sources include basic file and socket streams, advanced sources such as Kafka, Flume, and Kinesis (requiring extra JARs), and custom sources by extending Receiver. Each Receiver occupies a core for its entire lifetime, so the number of cores must exceed the number of Receivers plus processing cores. Local mode should use local[n] with n > #receiver, not local or local[1].
Receivers are classified as reliable (acknowledgment supported) or unreliable.
Transformations on DStreams are lazy operations similar to RDD transformations. Example of aggregating signal values per device: avgSignalDf = eventsDF.groupby("deviceId").avg("signal") Windowed aggregations use event time:
windowedAvgSignalDF1 = eventsDF.groupBy("deviceId", window("eventTime", "5 minute")).count() windowedAvgSignalDF2 = eventsDF.groupBy("deviceId", window("eventTime", "10 minute", "5 minute")).count()Watermarking defines how long late data is retained:
windowedAvgSignalDF4 = eventsDF.withWatermark("eventTime", "10 minutes").groupBy("deviceId", window("eventTime", "10 minute", "5 minute")).count()Output operations push DStream results to external systems (e.g., databases, HDFS). The foreachRDD pattern is common, but creating a connection inside the driver and serializing it to executors causes errors. Instead, a connection should be created per executor partition or managed via a connection pool to avoid costly per‑record connection creation.
Checkpointing provides fault tolerance for long‑running streaming jobs. Two types of data are checkpointed: metadata (configuration, DStream operations, incomplete batches) and RDD data. Enable checkpointing for driver recovery, stateful transformations, and windowed operations, and set the checkpoint interval to roughly 5–10 times the batch interval.
Accumulators and broadcast variables cannot be recovered from checkpoints, but lazy‑instantiated singleton objects can be recreated after a failure.
Fault‑tolerance semantics include at‑most‑once, at‑least‑once, and exactly‑once guarantees. Receivers with acknowledgment and Spark’s write‑ahead log (WAL) enable exactly‑once semantics for input data. Output operations are at‑least‑once by default; achieving exactly‑once requires idempotent writes or transactional updates using unique identifiers.
Deployment considerations involve using a cluster manager, packaging code as a JAR, allocating sufficient executor memory to hold the required state (e.g., window width), configuring checkpoint directories, enabling WAL, and setting receiver rate limits ( spark.streaming.receiver.maxRate, spark.streaming.kafka.maxRatePerPartition) or back‑pressure ( spark.streaming.backpressure.enabled).
Performance tuning focuses on reducing batch processing time by adjusting parallelism in data receiving (multiple receivers, block interval via spark.streaming.blockInterval) and processing (default parallelism via spark.default.parallelism), choosing appropriate serialization (Kryo, disabling serialization for small batches), and minimizing task launch overhead. Selecting an appropriate batch interval (typically 5–10 seconds) and tuning memory usage, GC settings, and persistence levels further improve throughput.
Monitoring uses Spark Streaming UI metrics such as processing time, scheduling delay, total delay, and active/completed batches to ensure processing time stays below the batch interval and to detect bottlenecks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
