Big Data 8 min read

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

This article explains the design and implementation of Spark Structured Streaming's StateStore module, covering its distributed architecture, state sharding, versioning, batch read/write, migration, update/query APIs, maintenance compaction, and fault‑tolerance mechanisms that enable incremental continuous queries with exactly‑once guarantees.

Big Data Technology & Architecture

Jun 19, 2019

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

In Spark Structured Streaming, the continuous query engine StreamExecution drives batch execution, and for stateful operations like count() it requires cross‑batch state support, which is provided by the StateStore module.

The StateStore is a distributed key‑value store that runs on Spark's driver‑executor architecture: the driver acts as a lightweight coordinator, while executors handle the actual read/write of state shards.

State is partitioned by operatorId + partitionId, and each shard stores key‑value pairs where both key and value are UnsafeRow objects, allowing efficient look‑ups and updates.

Because batches are continuously processed, each operator‑partition pair maintains a versioned state; when batch n finishes, all state with version n is persisted.

Batch writes accumulate all modifications of the current version into a log file on HDFS and then commit it atomically, while reads load the appropriate versioned snapshot (or the latest snapshot plus deltas) into memory.

State migration : when a task runs on an executor, it loads the required shard (identified by operator and partition) from the local HDFS replica if available; otherwise it fetches the shard from another node’s replica. Multiple shards can coexist on the same executor, and the same shard may be loaded on different executors simultaneously, with only one executor ultimately committing the changes.

Update and query API (excerpt of the Scala interface):

def get(key: UnsafeRow): Option[UnsafeRow]

def put(key: UnsafeRow, value: UnsafeRow): Unit

def remove(condition: UnsafeRow => Boolean): Unit

def remove(key: UnsafeRow): Unit

def commit(): Long

def abort(): Unit

def iterator(): Iterator[(UnsafeRow, UnsafeRow)]

def updates(): Iterator[StoreUpdate]

Typical usage inside Structured Streaming (simplified):

val store = StateStore.get(StateStoreId(checkpointLocation, operatorId, partitionId), ..., version, ...)
store.put(...)
store.remove(...)
store.commit()
store.iterator()
store.updates()

Maintenance : a background maintainer periodically compacts old snapshots and delta logs (similar to HBase major/minor compaction) by merging old_snapshot + delta_a + delta_b + … → latest_snapshot, reducing the number of log files.

Fault recovery : the authoritative source of truth is HDFS. If a shard fails during an update, uncommitted changes are lost. On recovery, the executor reloads the latest visible snapshot (or the most recent snapshot plus pending deltas) and re‑executes the affected batch, ensuring end‑to‑end exactly‑once semantics.

In summary, the StateStore provides a sharded, versioned, migratable, highly available key‑value store that enables incremental continuous queries and robust fault tolerance in Spark Structured Streaming.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming fault tolerance Spark Structured Streaming StateStore

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.