Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)
This article explores the two toughest problems in distributed stream processing—exactly‑once message handling and ordering—by dissecting the underlying impossibility of perfect failure detectors, the liveness‑vs‑safety trade‑off, zombie processes, and the practical solutions employed by systems such as Flink, Kafka Streams, MillWheel, and Spark.
Preface
The author’s goal is to dig into the fundamental reasons why distributed stream systems must grapple with exactly‑once processing and ordering, and to compare how state‑of‑the‑art frameworks address these challenges.
Key Terminology
End‑to‑End Consistency (E2E Consistency) : In stream middleware this means exactly‑once message processing—each logical message should affect the system exactly once.
EOMP (Effective Once Message Processing) : A realistic variant of exactly‑once; the system may process a message multiple times under failure, but the final outcome must be indistinguishable from a single processing.
Failure Tolerance : Assumes crash failures; a node may disappear completely, and the system must continue correctly.
Idempotent : Repeating the same operation any number of times yields the same effect as a single execution.
Deterministic : Given identical inputs, a computation always produces the same output.
The Impossibility of Perfect Failure Detection
In asynchronous environments it is impossible to distinguish a crashed node from a slow or overloaded one. This leads to the classic FLP impossibility result: consensus cannot be guaranteed with 100% safety and liveness simultaneously.
Liveness vs. Safety Trade‑off
When a node stops responding, a system must either wait (preserving safety) or assume failure and failover (preserving liveness). Both choices have concrete costs, and the balance determines whether a system favors low latency or strong correctness guarantees.
Zombie Processes and Fencing
Even with consensus protocols, a “zombie” node—one that is still alive but mistakenly considered dead—can cause duplicate writes or state corruption. Zombie fencing typically relies on a monotonically increasing epoch number allocated via consensus; newer generations reject writes from older epochs.
Zombie Fencing Techniques
Assign a globally unique, ever‑increasing epoch to each logical process instance.
Reject any request bearing an older epoch.
Use test‑and‑set or compare‑and‑swap on a dedicated “epoch” column in a KV store to enforce fencing.
Three‑Node EOMP Example
Consider a pipeline of upstream → processor → downstream. To guarantee exactly‑once processing, the processor must ensure four conditions: high‑availability of results, downstream deduplication, upstream replayability, and progress tracking.
Store results in a highly available datastore (e.g., DynamoDB, Spanner).
Downstream nodes keep a set of processed message IDs for deduplication.
Upstream must be able to replay the original event if needed.
Record processing progress atomically with the result.
If any step fails, the system may need to recompute, but deterministic computation guarantees identical outcomes.
Stateful Computation and Adding Node State
When a node maintains state (e.g., a counter), the state itself must be checkpointed atomically with the result. Upon failure, the node rolls back its state before reprocessing the event to avoid double counting.
Flow System EOMP Strategies
Two broad approaches exist:
Record‑Every‑Result : Store each intermediate result (used by Google MillWheel/Dataflow and Kafka Streams). This enables exact‑once guarantees even for non‑deterministic jobs.
Recompute‑When‑Needed : Rely on deterministic computation and replay upstream events, avoiding the storage overhead of every intermediate result (used by Spark Structured Streaming).
Google MillWheel (Dataflow)
Each node keeps a set of processed message IDs for deduplication, writes local state to memory, and commits the result, the state, and the input ID atomically to BigTable.
Kafka Streams
Built on Kafka’s transactional messaging. A producer starts a transaction, writes output records, updates the consumer offset, and commits atomically. The transaction guarantees that either all writes and the offset advance are visible or none are.
producer.initTransactions();
producer.beginTransaction();
// send records …
producer.commitTransaction();Spark Streaming
Offers three APIs: the legacy D‑Streaming (being deprecated), Structured Streaming (micro‑batch with checkpointed lineage), and Continuous Processing (low‑latency but only at‑least‑once). Structured Streaming stores lineage information and checkpoints state to enable deterministic replay.
Checkpointing and State Management
Flink uses RocksDB for local state. During a checkpoint, only the current memtable is flushed to disk (blocking the computation briefly). The newly created SST files are then uploaded asynchronously to durable storage (e.g., S3, HDFS), achieving asynchronous incremental checkpointing.
Sinks: Idempotent vs. Two‑Phase Commit (2PC)
External sinks must cooperate with the stream’s exactly‑once guarantees.
Idempotent Sink : Guarantees that repeated writes have the same effect as a single write. Works well for deterministic pipelines but cannot handle non‑deterministic recomputation.
2PC Sink : Flink’s TwoPhaseCommitSinkFunction integrates external transactional resources (databases, message queues) into the global checkpoint. The sink pre‑commits during the checkpoint barrier, then the coordinator decides to commit or abort globally.
Key steps for a 2PC sink:
Start a transaction on the external system.
When the checkpoint barrier arrives, write data and pre‑commit, then acknowledge the barrier.
After all nodes have acknowledged, the coordinator records the global checkpoint (commit decision).
On commit, the sink finalizes the external transaction; on abort, it rolls back.
Because the external transaction is hidden from downstream consumers until the global commit, exactly‑once semantics are preserved even if the sink crashes after pre‑commit.
Latency, Idempotence, and Non‑Determinism
Idempotent sinks provide the lowest end‑to‑end latency because they can write results immediately. However, they cannot guarantee correctness for non‑deterministic jobs where recomputation may produce different keys or partitions, leading to divergent downstream effects.
For non‑deterministic workloads, systems that record every result (Dataflow, Kafka Streams) or that employ a 2PC sink (Flink) are required.
References
Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications
Designing Data‑Intensive Applications
Lightweight Asynchronous Snapshots for Distributed Dataflows
Streaming Systems: The What, Where, When, and How of Large‑Scale Data Processing
Kafka Streams in Action
The Dataflow Model
MillWheel: Fault‑Tolerant Stream Processing at Internet Scale
Distributed Snapshots: Determining Global States of Distributed Systems (Chandy‑Lamport)
State Management in Apache Flink
Discretized Streams: Fault‑Tolerant Streaming Computation at Scale
Continuous Processing in Structured Streaming Design Sketch
Structured Streaming: A Declarative API for Real‑Time Applications in Apache Spark
Spark Streaming Programming Guide
Watermarks, Tables, Event Time, and the Dataflow Model
Kafka Streams’ Take on Watermarks and Triggers
Transactions in Apache Kafka
Consensus on Transaction Commit
Unreliable Failure Detectors for Reliable Distributed Systems
The Weakest Failure Detector for Solving Consensus
Exactly‑Once Delivery and Transactional Messaging in Kafka
Impossibility of Distributed Consensus with One Faulty Process (FLP)
Kubernetes in Action
ZooKeeper: Distributed Process Coordination
An Overview of End‑to‑End Exactly‑Once Processing in Apache Flink
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
