Big Data 81 min read

Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)

This article explores the two toughest problems in distributed stream processing—exactly‑once message handling and ordering—by dissecting the underlying impossibility of perfect failure detectors, the liveness‑vs‑safety trade‑off, zombie processes, and the practical solutions employed by systems such as Flink, Kafka Streams, MillWheel, and Spark.

dbaplus Community

Sep 10, 2019

Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)

Preface

The author’s goal is to dig into the fundamental reasons why distributed stream systems must grapple with exactly‑once processing and ordering, and to compare how state‑of‑the‑art frameworks address these challenges.

Key Terminology

End‑to‑End Consistency (E2E Consistency) : In stream middleware this means exactly‑once message processing—each logical message should affect the system exactly once.

EOMP (Effective Once Message Processing) : A realistic variant of exactly‑once; the system may process a message multiple times under failure, but the final outcome must be indistinguishable from a single processing.

Failure Tolerance : Assumes crash failures; a node may disappear completely, and the system must continue correctly.

Idempotent : Repeating the same operation any number of times yields the same effect as a single execution.

Deterministic : Given identical inputs, a computation always produces the same output.

The Impossibility of Perfect Failure Detection

In asynchronous environments it is impossible to distinguish a crashed node from a slow or overloaded one. This leads to the classic FLP impossibility result: consensus cannot be guaranteed with 100% safety and liveness simultaneously.

Liveness vs. Safety Trade‑off

When a node stops responding, a system must either wait (preserving safety) or assume failure and failover (preserving liveness). Both choices have concrete costs, and the balance determines whether a system favors low latency or strong correctness guarantees.

Zombie Processes and Fencing

Even with consensus protocols, a “zombie” node—one that is still alive but mistakenly considered dead—can cause duplicate writes or state corruption. Zombie fencing typically relies on a monotonically increasing epoch number allocated via consensus; newer generations reject writes from older epochs.

Zombie Fencing Techniques

Assign a globally unique, ever‑increasing epoch to each logical process instance.

Reject any request bearing an older epoch.

Use test‑and‑set or compare‑and‑swap on a dedicated “epoch” column in a KV store to enforce fencing.

Three‑Node EOMP Example

Consider a pipeline of upstream → processor → downstream. To guarantee exactly‑once processing, the processor must ensure four conditions: high‑availability of results, downstream deduplication, upstream replayability, and progress tracking.

Store results in a highly available datastore (e.g., DynamoDB, Spanner).

Downstream nodes keep a set of processed message IDs for deduplication.

Upstream must be able to replay the original event if needed.

Record processing progress atomically with the result.

If any step fails, the system may need to recompute, but deterministic computation guarantees identical outcomes.

Stateful Computation and Adding Node State

When a node maintains state (e.g., a counter), the state itself must be checkpointed atomically with the result. Upon failure, the node rolls back its state before reprocessing the event to avoid double counting.

Flow System EOMP Strategies

Two broad approaches exist:

Record‑Every‑Result : Store each intermediate result (used by Google MillWheel/Dataflow and Kafka Streams). This enables exact‑once guarantees even for non‑deterministic jobs.

Recompute‑When‑Needed : Rely on deterministic computation and replay upstream events, avoiding the storage overhead of every intermediate result (used by Spark Structured Streaming).

Google MillWheel (Dataflow)

Each node keeps a set of processed message IDs for deduplication, writes local state to memory, and commits the result, the state, and the input ID atomically to BigTable.

Kafka Streams

Built on Kafka’s transactional messaging. A producer starts a transaction, writes output records, updates the consumer offset, and commits atomically. The transaction guarantees that either all writes and the offset advance are visible or none are.

producer.initTransactions();
producer.beginTransaction();
// send records …
producer.commitTransaction();

Spark Streaming

Offers three APIs: the legacy D‑Streaming (being deprecated), Structured Streaming (micro‑batch with checkpointed lineage), and Continuous Processing (low‑latency but only at‑least‑once). Structured Streaming stores lineage information and checkpoints state to enable deterministic replay.

Checkpointing and State Management

Flink uses RocksDB for local state. During a checkpoint, only the current memtable is flushed to disk (blocking the computation briefly). The newly created SST files are then uploaded asynchronously to durable storage (e.g., S3, HDFS), achieving asynchronous incremental checkpointing.

Sinks: Idempotent vs. Two‑Phase Commit (2PC)

External sinks must cooperate with the stream’s exactly‑once guarantees.

Idempotent Sink : Guarantees that repeated writes have the same effect as a single write. Works well for deterministic pipelines but cannot handle non‑deterministic recomputation.

2PC Sink : Flink’s TwoPhaseCommitSinkFunction integrates external transactional resources (databases, message queues) into the global checkpoint. The sink pre‑commits during the checkpoint barrier, then the coordinator decides to commit or abort globally.

Key steps for a 2PC sink:

Start a transaction on the external system.

When the checkpoint barrier arrives, write data and pre‑commit, then acknowledge the barrier.

After all nodes have acknowledged, the coordinator records the global checkpoint (commit decision).

On commit, the sink finalizes the external transaction; on abort, it rolls back.

Because the external transaction is hidden from downstream consumers until the global commit, exactly‑once semantics are preserved even if the sink crashes after pre‑commit.

Latency, Idempotence, and Non‑Determinism

Idempotent sinks provide the lowest end‑to‑end latency because they can write results immediately. However, they cannot guarantee correctness for non‑deterministic jobs where recomputation may produce different keys or partitions, leading to divergent downstream effects.

For non‑deterministic workloads, systems that record every result (Dataflow, Kafka Streams) or that employ a 2PC sink (Flink) are required.

References

Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

Designing Data‑Intensive Applications

Lightweight Asynchronous Snapshots for Distributed Dataflows

Streaming Systems: The What, Where, When, and How of Large‑Scale Data Processing

Kafka Streams in Action

The Dataflow Model

MillWheel: Fault‑Tolerant Stream Processing at Internet Scale

Distributed Snapshots: Determining Global States of Distributed Systems (Chandy‑Lamport)

State Management in Apache Flink

Discretized Streams: Fault‑Tolerant Streaming Computation at Scale

Continuous Processing in Structured Streaming Design Sketch

Structured Streaming: A Declarative API for Real‑Time Applications in Apache Spark

Spark Streaming Programming Guide

Watermarks, Tables, Event Time, and the Dataflow Model

Kafka Streams’ Take on Watermarks and Triggers

Transactions in Apache Kafka

Consensus on Transaction Commit

Unreliable Failure Detectors for Reliable Distributed Systems

The Weakest Failure Detector for Solving Consensus

Exactly‑Once Delivery and Transactional Messaging in Kafka

Impossibility of Distributed Consensus with One Faulty Process (FLP)

Kubernetes in Action

ZooKeeper: Distributed Process Coordination

An Overview of End‑to‑End Exactly‑Once Processing in Apache Flink

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems stream processing fault tolerance Consensus Exactly-once checkpointing

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.