Big Data 6 min read

Understanding Spark Streaming Backpressure Mechanism

The article explains how Spark Streaming backpressure, introduced in version 1.5, automatically adjusts data ingestion rates based on processing delays, replaces manual rate limits, and details its architecture, configuration parameters, and usage for preventing data backlog and executor OOM.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Spark Streaming Backpressure Mechanism

By default, Spark Streaming receives data via receivers or the Direct approach at the producer's rate; when the batch processing time exceeds the batch interval, data accumulates, potentially causing executor OOM failures.

Before Spark 1.5, rate limiting could be achieved by setting spark.streaming.receiver.maxRate for receiver‑based inputs or spark.streaming.kafka.maxRatePerPartition for Direct Kafka inputs, but this required manual estimation, restarts, and could lead to under‑utilisation.

Spark 1.5 introduced a Backpressure mechanism that dynamically adapts to the cluster's processing capacity by automatically estimating an appropriate ingestion rate.

Spark Streaming architecture before Spark 1.5

Data is continuously received by receivers, stored in the Block Manager, and replicated for fault tolerance.

Receiver Tracker maintains a mapping from timestamps to stored block IDs.

Job Generator creates a JobSet every batch interval.

Job Scheduler executes the generated JobSet.

Spark Streaming architecture after Spark 1.5

A new RateController component (extending StreamingListener) listens to onBatchCompleted events and uses processingDelay, schedulingDelay, record counts, and completion events to estimate a maximum processing rate via a RateEstimator (PID‑based in Spark 2.2).

The calculated rate is stored in the InputDStream's RateController and, after each batch, pushed to ReceiverSupervisorImpl so receivers know how much data to pull next.

If the user also configures spark.streaming.receiver.maxRate or spark.streaming.kafka.maxRatePerPartition, the effective rate is the minimum of the three values.

To enable backpressure, set spark.streaming.backpressure.enabled to true. Additional tunable parameters include:

spark.streaming.backpressure.initialRate : initial maximum rate for the first batch (no default).

spark.streaming.backpressure.rateEstimator : estimator class, default is pid.

spark.streaming.backpressure.pid.proportional : weight for error response, default 1, non‑negative.

spark.streaming.backpressure.pid.integral : weight for accumulated error (damping), default 0.2, non‑negative.

spark.streaming.backpressure.pid.derived : weight for error trend, default 0, non‑negative.

spark.streaming.backpressure.pid.minRate : minimum estimable rate, default 100, non‑negative.

These settings allow Spark Streaming to automatically regulate ingestion rates, preventing data backlog and improving resource utilization without manual intervention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataStreamingSparkRate Controlbackpressure
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.