Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation
This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.
Introduction
With the growing prevalence of real‑time data, enterprises require streaming systems that are scalable, easy to use, and integrable with existing business applications. Structured Streaming offers a highly abstracted, declarative API built on Spark Streaming, differing from other streaming APIs by providing a static‑relational query model (SQL/DataFrames) and supporting end‑to‑end real‑time applications that combine stream, batch, and interactive analytics.
1. Introduction
High‑volume data sources such as sensors, mobile app logs, and IoT devices generate continuous streams. While distributed stream technologies have advanced, production use still faces two major challenges: (1) complex low‑level execution concepts (e.g., at‑least‑once delivery, state storage, trigger semantics) and (2) the need to integrate streaming with batch and interactive workloads, which often requires substantial engineering effort.
Structured Streaming addresses these challenges by borrowing ideas from systems like Google Dataflow, leveraging a relational execution engine for performance, and providing a unified API that is both easy to use and tightly integrated with Apache Spark.
Incremental Query Model
Structured Streaming automatically incrementalizes queries over static datasets using Spark SQL and the DataFrame API, allowing users to write streaming queries with the same mental model as batch queries. The model supports both advanced users who need fine‑grained stateful operators and beginners who can rely on high‑level declarative syntax.
End‑to‑End Application Support
The API and built‑in connectors simplify writing "correct by default" code that interacts with external systems. Sources and sinks follow a simple transactional model and support exactly‑once semantics by default. The engine reuses Spark SQL’s optimizer and runtime code generator, achieving throughput up to twice that of Flink and ninety times that of Apache Kafka.
2. Streaming Challenges
Despite recent progress, distributed streaming remains difficult to develop and operate. The main challenges include complex low‑level APIs, integration into larger applications, operational concerns (failure recovery, code updates, scaling, straggler nodes, monitoring), and performance constraints such as latency and resource efficiency.
2.1 Complex and Low‑Level APIs
Many stream systems require users to manage physical execution details, making them harder to use than batch systems. For example, Google Dataflow’s API forces users to specify windowing, triggering, and physical plans explicitly.
2.2 Integration into End‑to‑End Applications
Streaming tasks often need to interact with batch workloads, update external databases, and support interactive queries, which adds considerable engineering overhead.
2.3 Operational Challenges
Key operational issues include handling failures, updating code without losing state, dynamic scaling, dealing with straggler nodes, and providing visibility into system metrics.
2.4 Performance Challenges
Continuous 24/7 operation can be costly; without dynamic scaling resources are wasted during idle periods. Structured Streaming mitigates this by leveraging Spark SQL’s optimizations and supporting both micro‑batch and continuous processing modes.
3. Structured Streaming Overview
Structured Streaming combines API design with execution engine integration. The core components include input sources, the query planner, state store, write‑ahead log (WAL), and output sinks.
Input and Output
Sources must be replayable (e.g., Kafka, Kinesis) and sinks must support idempotent writes to guarantee exactly‑once semantics.
API
Users write queries using Spark SQL’s batch API (SQL, DataFrames, Datasets). Triggers control execution frequency, event‑time columns can be marked with watermarks, and stateful operators enable custom logic.
Execution
Upon receiving a query, Structured Streaming incrementalizes it and executes it either in micro‑batch mode or in a low‑latency continuous mode.
4. Programming Model
Structured Streaming merges ideas from Google Dataflow, incremental query processing, and Spark Streaming. The following short example demonstrates how to convert a static batch job into a streaming job.
4.1 Short Example
Static batch job (reading JSON, grouping by country, and writing Parquet):
// define a DataFrame to read from static data
val data = spark.read.format("json").load("/in")
// transform it to compute a result
val counts = data.groupBy("country").count()
// write to a static data sink
counts.write.format("parquet").save("/counts")Streaming version (only the source and sink lines change):
// define a DataFrame to read streaming data
val data = spark.readStream.format("json").load("/in")
// transform it to compute a result
val counts = data.groupBy("country").count()
// write to a streaming data sink
counts.writeStream.format("parquet").outputMode("complete").start("/counts")The output mode determines how results are written (complete, append, or update).
4.2 Programming Model Semantics
Each source provides an ordered subset of records. Users supply a query that produces a result table; the engine guarantees prefix consistency across all inputs. Triggers dictate when incremental computation occurs, and output modes control how the result table is materialized.
4.3 Stream‑Specific Operators
Two new operators are added to Spark SQL: watermarks (to bound event‑time state) and stateful operators (mapGroupsWithState, flatMapGroupsWithState) that allow custom per‑key state management.
4.3.1 Event‑time Watermarks
Watermarks are set with withWatermark on a timestamp column, enabling the engine to drop old state once the watermark advances.
4.3.2 Stateful Operators
Example of mapGroupsWithState that tracks the number of events per user and times out after 30 minutes:
def updateFunc(key: UserId, newValues: Iterator[Event], state: GroupState[Int]): Int = {
val totalEvents = state.get() + newValues.size
state.update(totalEvents)
state.setTimeoutDuration("30 min")
totalEvents
}
val lens = events.groupByKey(event => event.userId).mapGroupsWithState(updateFunc)5. Query Planning
Structured Streaming uses Spark’s Catalyst optimizer with three phases: analysis (validates queries and output modes), incrementalization (converts static queries into incremental form), and optimization (applies standard Spark SQL rules such as predicate push‑down).
5.1 Analysis
The engine checks query validity and ensures the chosen output mode is compatible (e.g., append mode requires monotonic queries).
5.2 Incrementalization
Supported incremental queries include selections, joins (with watermarks), a single aggregation, and stateful operators.
5.3 Query Optimization
All usual Spark SQL optimizations apply, including code generation and Tungsten binary format.
6. Application Execution
Execution consists of state management and two processing modes: micro‑batch and continuous.
6.1 State Management and Recovery
Two external stores are used: a write‑ahead log (WAL) for fault tolerance and a state store (e.g., HDFS or S3) for large‑scale state. Sources must be replayable and sinks must be idempotent.
6.2 Micro‑Batch Mode
Micro‑batch processing inherits Spark Streaming’s advantages: dynamic load balancing, fine‑grained fault recovery, straggler handling, elastic scaling, and high throughput.
6.3 Continuous Processing Mode
Introduced in Spark 2.3, continuous processing runs long‑lived tasks for lower latency (sub‑10 ms) while maintaining similar throughput to micro‑batch.
7. Production Use Cases
Structured Streaming powers hundreds of production pipelines at Databricks, processing over 1 PB of data per month across hundreds of machines. Example use cases include security analytics, real‑time alerting, and ETL pipelines that combine streaming and batch data.
7.1 Information‑Security Platform
A large customer built a security platform that ingests network‑traffic logs, stores them in Delta tables, joins with static reference data, and generates real‑time alerts. Structured Streaming reduced development time from six months (with traditional tools) to two weeks.
8. Performance Evaluation
Benchmarks on Yahoo’s streaming benchmark show Structured Streaming achieving up to 65 M records/sec, roughly twice the throughput of Apache Flink and far exceeding Kafka Streams. Scaling experiments demonstrate near‑linear throughput growth from 11.5 M/sec on a single node to 225 M/sec on 20 nodes.
8.1 Micro‑Batch vs. Continuous
Continuous mode offers lower latency (under 10 ms) with comparable throughput to micro‑batch, which incurs higher latency due to DAG scheduling.
Conclusion
Structured Streaming simplifies the development, operation, and integration of streaming applications by providing a high‑level, declarative API that leverages Spark SQL’s optimizer and execution engine. It delivers strong performance, scalability, and ease of use for a wide range of stateful streaming workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
