Big Data 42 min read

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.

Big Data Technology & Architecture

Jan 2, 2020

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

Introduction

With the growing prevalence of real‑time data, enterprises require streaming systems that are scalable, easy to use, and integrable with existing business applications. Structured Streaming offers a highly abstracted, declarative API built on Spark Streaming, differing from other streaming APIs by providing a static‑relational query model (SQL/DataFrames) and supporting end‑to‑end real‑time applications that combine stream, batch, and interactive analytics.

1. Introduction

High‑volume data sources such as sensors, mobile app logs, and IoT devices generate continuous streams. While distributed stream technologies have advanced, production use still faces two major challenges: (1) complex low‑level execution concepts (e.g., at‑least‑once delivery, state storage, trigger semantics) and (2) the need to integrate streaming with batch and interactive workloads, which often requires substantial engineering effort.

Structured Streaming addresses these challenges by borrowing ideas from systems like Google Dataflow, leveraging a relational execution engine for performance, and providing a unified API that is both easy to use and tightly integrated with Apache Spark.

Incremental Query Model

Structured Streaming automatically incrementalizes queries over static datasets using Spark SQL and the DataFrame API, allowing users to write streaming queries with the same mental model as batch queries. The model supports both advanced users who need fine‑grained stateful operators and beginners who can rely on high‑level declarative syntax.

End‑to‑End Application Support

The API and built‑in connectors simplify writing "correct by default" code that interacts with external systems. Sources and sinks follow a simple transactional model and support exactly‑once semantics by default. The engine reuses Spark SQL’s optimizer and runtime code generator, achieving throughput up to twice that of Flink and ninety times that of Apache Kafka.

2. Streaming Challenges

Despite recent progress, distributed streaming remains difficult to develop and operate. The main challenges include complex low‑level APIs, integration into larger applications, operational concerns (failure recovery, code updates, scaling, straggler nodes, monitoring), and performance constraints such as latency and resource efficiency.

2.1 Complex and Low‑Level APIs

Many stream systems require users to manage physical execution details, making them harder to use than batch systems. For example, Google Dataflow’s API forces users to specify windowing, triggering, and physical plans explicitly.

2.2 Integration into End‑to‑End Applications

Streaming tasks often need to interact with batch workloads, update external databases, and support interactive queries, which adds considerable engineering overhead.

2.3 Operational Challenges

Key operational issues include handling failures, updating code without losing state, dynamic scaling, dealing with straggler nodes, and providing visibility into system metrics.

2.4 Performance Challenges

Continuous 24/7 operation can be costly; without dynamic scaling resources are wasted during idle periods. Structured Streaming mitigates this by leveraging Spark SQL’s optimizations and supporting both micro‑batch and continuous processing modes.

3. Structured Streaming Overview

Structured Streaming combines API design with execution engine integration. The core components include input sources, the query planner, state store, write‑ahead log (WAL), and output sinks.

Input and Output

Sources must be replayable (e.g., Kafka, Kinesis) and sinks must support idempotent writes to guarantee exactly‑once semantics.

API

Users write queries using Spark SQL’s batch API (SQL, DataFrames, Datasets). Triggers control execution frequency, event‑time columns can be marked with watermarks, and stateful operators enable custom logic.

Execution

Upon receiving a query, Structured Streaming incrementalizes it and executes it either in micro‑batch mode or in a low‑latency continuous mode.

4. Programming Model

Structured Streaming merges ideas from Google Dataflow, incremental query processing, and Spark Streaming. The following short example demonstrates how to convert a static batch job into a streaming job.

4.1 Short Example

Static batch job (reading JSON, grouping by country, and writing Parquet):

// define a DataFrame to read from static data
val data = spark.read.format("json").load("/in")

// transform it to compute a result
val counts = data.groupBy("country").count()

// write to a static data sink
counts.write.format("parquet").save("/counts")

Streaming version (only the source and sink lines change):

// define a DataFrame to read streaming data
val data = spark.readStream.format("json").load("/in")

// transform it to compute a result
val counts = data.groupBy("country").count()

// write to a streaming data sink
counts.writeStream.format("parquet").outputMode("complete").start("/counts")

The output mode determines how results are written (complete, append, or update).

4.2 Programming Model Semantics

Each source provides an ordered subset of records. Users supply a query that produces a result table; the engine guarantees prefix consistency across all inputs. Triggers dictate when incremental computation occurs, and output modes control how the result table is materialized.

4.3 Stream‑Specific Operators

Two new operators are added to Spark SQL: watermarks (to bound event‑time state) and stateful operators (mapGroupsWithState, flatMapGroupsWithState) that allow custom per‑key state management.

4.3.1 Event‑time Watermarks

Watermarks are set with withWatermark on a timestamp column, enabling the engine to drop old state once the watermark advances.

4.3.2 Stateful Operators

Example of mapGroupsWithState that tracks the number of events per user and times out after 30 minutes:

def updateFunc(key: UserId, newValues: Iterator[Event], state: GroupState[Int]): Int = {
  val totalEvents = state.get() + newValues.size
  state.update(totalEvents)
  state.setTimeoutDuration("30 min")
  totalEvents
}

val lens = events.groupByKey(event => event.userId).mapGroupsWithState(updateFunc)

5. Query Planning

Structured Streaming uses Spark’s Catalyst optimizer with three phases: analysis (validates queries and output modes), incrementalization (converts static queries into incremental form), and optimization (applies standard Spark SQL rules such as predicate push‑down).

5.1 Analysis

The engine checks query validity and ensures the chosen output mode is compatible (e.g., append mode requires monotonic queries).

5.2 Incrementalization

Supported incremental queries include selections, joins (with watermarks), a single aggregation, and stateful operators.

5.3 Query Optimization

All usual Spark SQL optimizations apply, including code generation and Tungsten binary format.

6. Application Execution

Execution consists of state management and two processing modes: micro‑batch and continuous.

6.1 State Management and Recovery

Two external stores are used: a write‑ahead log (WAL) for fault tolerance and a state store (e.g., HDFS or S3) for large‑scale state. Sources must be replayable and sinks must be idempotent.

6.2 Micro‑Batch Mode

Micro‑batch processing inherits Spark Streaming’s advantages: dynamic load balancing, fine‑grained fault recovery, straggler handling, elastic scaling, and high throughput.

6.3 Continuous Processing Mode

Introduced in Spark 2.3, continuous processing runs long‑lived tasks for lower latency (sub‑10 ms) while maintaining similar throughput to micro‑batch.

7. Production Use Cases

Structured Streaming powers hundreds of production pipelines at Databricks, processing over 1 PB of data per month across hundreds of machines. Example use cases include security analytics, real‑time alerting, and ETL pipelines that combine streaming and batch data.

7.1 Information‑Security Platform

A large customer built a security platform that ingests network‑traffic logs, stores them in Delta tables, joins with static reference data, and generates real‑time alerts. Structured Streaming reduced development time from six months (with traditional tools) to two weeks.

8. Performance Evaluation

Benchmarks on Yahoo’s streaming benchmark show Structured Streaming achieving up to 65 M records/sec, roughly twice the throughput of Apache Flink and far exceeding Kafka Streams. Scaling experiments demonstrate near‑linear throughput growth from 11.5 M/sec on a single node to 225 M/sec on 20 nodes.

8.1 Micro‑Batch vs. Continuous

Continuous mode offers lower latency (under 10 ms) with comparable throughput to micro‑batch, which incurs higher latency due to DAG scheduling.

Conclusion

Structured Streaming simplifies the development, operation, and integration of streaming applications by providing a high‑level, declarative API that leverages Spark SQL’s optimizer and execution engine. It delivers strong performance, scalability, and ease of use for a wide range of stateful streaming workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Spark Structured Streaming

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.