Big Data 11 min read

How Spark Structured Streaming’s Real-Time Mode Achieves Millisecond Latency

This article explains Spark Structured Streaming’s new Real-Time Mode introduced in Spark 4.1, detailing how it reduces latency to the millisecond level by redesigning micro‑batch processing, concurrent stage scheduling, streaming shuffle, and checkpointing, and compares it with Flink’s native streaming.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
How Spark Structured Streaming’s Real-Time Mode Achieves Millisecond Latency

Introduction

When discussing stream processing, Spark and Flink are the usual suspects. Flink processes each record as it arrives, while Spark Structured Streaming historically relied on micro‑batches, leading to higher latency but offering a familiar batch‑oriented API for many users.

With Spark 4.1, a new Structured Streaming Real-Time Mode (RTM) was added, promising millisecond‑level latency and narrowing the gap with Flink.

Spark Cluster Overview

Spark runs on a distributed cluster composed of a driver and multiple executors. In cluster mode, the driver is launched on a physical node and coordinates with the cluster manager to start executors.

Spark Core Concepts

Job – a complete workflow of transformations on data.

Stage – a segment of a job that can be executed without shuffling; a job is split into stages when a shuffle is required.

DAG – a directed acyclic graph of stages derived from RDD dependencies.

Task – the smallest execution unit, representing a partition of a stage.

Scheduling

The driver generates an execution plan and schedules work using three internal components:

DAGScheduler – stage‑oriented scheduling.

TaskScheduler – task‑oriented scheduling.

SchedulerBackend – interacts with the cluster manager to provide resources.

DAGScheduler respects the DAG topology, submitting a stage only after all its upstream dependencies finish.

Micro‑batch Structured Streaming

In the classic model, Spark treats a continuous stream as a series of bounded datasets (micro‑batches). Each micro‑batch is processed like a regular batch job: the driver parses logic, creates and optimizes a plan, and then schedules stages.

The processing steps for each micro‑batch are:

Query the source (e.g., fetch the latest Kafka offsets).

Identify new data that arrived since the previous batch.

Treat the new data as a static DataFrame and process it.

Latency Limitations of Micro‑batch

Because stages are scheduled sequentially and the driver must coordinate communication, parsing, planning, and task submission for every batch, the overall latency is higher, making Flink the preferred choice for sub‑second requirements.

Real‑Time Mode (RTM)

Spark 4.1 introduces RTM to achieve millisecond latency. The key improvements include:

Consuming Fresh Data

Instead of waiting for a trigger, the engine continuously reads the latest data from the source, using a longer default batch interval (5 minutes) to reduce overhead.

Stages Scheduled Concurrently

All stages are scheduled in parallel, eliminating the wait for upstream stages to finish before starting downstream ones, which reduces coordination and planning overhead.

Streaming Shuffle

RTM introduces a streaming shuffle that passes data between stages directly in memory, avoiding the disk‑based shuffle used in micro‑batch mode. This reduces the latency introduced by writing shuffle data to local disks and reading it back.

Checkpointing Changes

In micro‑batch mode, checkpoints are written at the start of each batch because the driver knows the data range in advance. In RTM, checkpoints occur at the end of a batch, since the driver does not know the batch’s end boundary ahead of time. Larger batches reduce checkpoint frequency (lower overhead) but increase recovery work; smaller batches increase checkpoint frequency, improving recovery speed.

Conclusion

The essential change for users is to replace the micro‑batch trigger with RealTimeTrigger. This allows Spark users to achieve millisecond latency without migrating to Flink, preserving existing Spark investments while avoiding the operational complexity of managing a separate streaming engine.

big dataStreamingStructured StreamingReal-Time Mode
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.