Big Data 13 min read

Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure

This article compares Flink and Spark Structured Streaming, detailing their differences in join capabilities, state management, fault‑tolerance mechanisms, exactly‑once semantics, back‑pressure handling, and table registration, while providing code examples and practical insights for real‑time big‑data processing.

Big Data Technology & Architecture

Nov 14, 2019

Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure

Link is a standard real‑time processing engine; Spark Streaming and Structured Streaming are based on micro‑batch processing, with Spark Streaming now stable and development focus shifted to Spark SQL and Structured Streaming.

Join Operations

Structured Streaming does not directly support dimension‑table joins; it can achieve similar functionality using map, flatMap or UDFs, which are synchronous operators and do not support asynchronous I/O. It can join with static, immutable datasets.

Flink supports dimension‑table joins and provides asynchronous I/O operators to improve performance.

State Management

State maintenance is a core concept of stream processing. Flink offers a rich set of state primitives such as ValueState, ListState, ReducingState, FoldingState, and MapState, while Spark Streaming provides only a few built‑in state operators (e.g., updateStateByKey) and often requires users to manage state manually.

Structured Streaming supplies advanced operators like mapGroupsWithState and flatMapGroupsWithState for custom state handling.

Join Types in Flink

Flink supports various join types, including inner equi‑join, outer joins, time‑windowed joins, array expansion, table‑function joins, and temporal table joins. Example SQL snippets:

SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id

SELECT * FROM Orders LEFT JOIN Product ON Orders.productId = Product.id

SELECT * FROM Orders RIGHT JOIN Product ON Orders.productId = Product.id

SELECT * FROM Orders FULL OUTER JOIN Product ON Orders.productId = Product.id

SELECT * FROM Orders o, Shipment s WHERE o.id = s.orderId AND o.orderTime BETWEEN s.shipTime - INTERVAL '4' HOUR AND s.shipTime

SELECT users, tag FROM Orders CROSS JOIN UNNEST(tags) AS t(tag)

SELECT users, tag FROM Orders, LATERAL TABLE(unnest_udtf(tags)) t AS tag

Fault Tolerance and Exactly‑Once Semantics

Spark Streaming achieves at‑least‑once processing via checkpoints; exactly‑once requires the result write and offset commit to be performed within a single transaction. One approach is to repartition to a single partition and commit within a transaction:

Dstream.foreachRDD(rdd => {
   rdd.repartition(1).foreachPartition(partition => {
       // commit data
   })
})

Flink guarantees exactly‑once by using a two‑phase commit protocol with a pre‑commit phase. When checkpointing starts, a barrier is injected, state snapshots are taken, and external sinks (e.g., Kafka) participate in the transaction. Upon successful checkpoint, Flink commits both internal state and external transactions.

Backpressure

Spark Streaming integrates with Kafka using a PID‑based RateController that adjusts the consumption rate based on processing delay, scheduling delay, and batch interval.

Flink measures backpressure by sampling stack traces of each task every 50 ms, calculating a blockage ratio classified as OK (0‑0.10), LOW (0.10‑0.5), or HIGH (0.5‑1).

Table Management

Both Flink and Structured Streaming can register streams as tables and run SQL queries.

df.createOrReplaceTempView("updates")
spark.sql("select count(*) from updates")

In Flink, SQL can be executed via sqlQuery or sqlUpdate:

tableEnv.sqlQuery("SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'")

tableEnv.sqlUpdate("INSERT INTO RubberOrders SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'")

Table sinks can be registered and used to write results, for example:

TableSink csvSink = newCsvTableSink("/path/to/file", ...);
String[] fieldNames = {"product", "amount"};
TypeInformation[] fieldTypes = {Types.STRING, Types.INT};
tableEnv.registerTableSink("RubberOrders", fieldNames, fieldTypes, csvSink);

The article concludes with practical tips and references for further reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink State Management JOIN backpressure Spark Structured Streaming

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.