Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure
This article compares Flink and Spark Structured Streaming, detailing their differences in join capabilities, state management, fault‑tolerance mechanisms, exactly‑once semantics, back‑pressure handling, and table registration, while providing code examples and practical insights for real‑time big‑data processing.
Link is a standard real‑time processing engine; Spark Streaming and Structured Streaming are based on micro‑batch processing, with Spark Streaming now stable and development focus shifted to Spark SQL and Structured Streaming.
Join Operations
Structured Streaming does not directly support dimension‑table joins; it can achieve similar functionality using map, flatMap or UDFs, which are synchronous operators and do not support asynchronous I/O. It can join with static, immutable datasets.
Flink supports dimension‑table joins and provides asynchronous I/O operators to improve performance.
State Management
State maintenance is a core concept of stream processing. Flink offers a rich set of state primitives such as ValueState, ListState, ReducingState, FoldingState, and MapState, while Spark Streaming provides only a few built‑in state operators (e.g., updateStateByKey) and often requires users to manage state manually.
Structured Streaming supplies advanced operators like mapGroupsWithState and flatMapGroupsWithState for custom state handling.
Join Types in Flink
Flink supports various join types, including inner equi‑join, outer joins, time‑windowed joins, array expansion, table‑function joins, and temporal table joins. Example SQL snippets:
SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id SELECT * FROM Orders LEFT JOIN Product ON Orders.productId = Product.id SELECT * FROM Orders RIGHT JOIN Product ON Orders.productId = Product.id SELECT * FROM Orders FULL OUTER JOIN Product ON Orders.productId = Product.id SELECT * FROM Orders o, Shipment s WHERE o.id = s.orderId AND o.orderTime BETWEEN s.shipTime - INTERVAL '4' HOUR AND s.shipTime SELECT users, tag FROM Orders CROSS JOIN UNNEST(tags) AS t(tag) SELECT users, tag FROM Orders, LATERAL TABLE(unnest_udtf(tags)) t AS tagFault Tolerance and Exactly‑Once Semantics
Spark Streaming achieves at‑least‑once processing via checkpoints; exactly‑once requires the result write and offset commit to be performed within a single transaction. One approach is to repartition to a single partition and commit within a transaction:
Dstream.foreachRDD(rdd => {
rdd.repartition(1).foreachPartition(partition => {
// commit data
})
})Flink guarantees exactly‑once by using a two‑phase commit protocol with a pre‑commit phase. When checkpointing starts, a barrier is injected, state snapshots are taken, and external sinks (e.g., Kafka) participate in the transaction. Upon successful checkpoint, Flink commits both internal state and external transactions.
Backpressure
Spark Streaming integrates with Kafka using a PID‑based RateController that adjusts the consumption rate based on processing delay, scheduling delay, and batch interval.
Flink measures backpressure by sampling stack traces of each task every 50 ms, calculating a blockage ratio classified as OK (0‑0.10), LOW (0.10‑0.5), or HIGH (0.5‑1).
Table Management
Both Flink and Structured Streaming can register streams as tables and run SQL queries.
df.createOrReplaceTempView("updates")
spark.sql("select count(*) from updates")In Flink, SQL can be executed via sqlQuery or sqlUpdate:
tableEnv.sqlQuery("SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'") tableEnv.sqlUpdate("INSERT INTO RubberOrders SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'")Table sinks can be registered and used to write results, for example:
TableSink csvSink = newCsvTableSink("/path/to/file", ...);
String[] fieldNames = {"product", "amount"};
TypeInformation[] fieldTypes = {Types.STRING, Types.INT};
tableEnv.registerTableSink("RubberOrders", fieldNames, fieldTypes, csvSink);The article concludes with practical tips and references for further reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
