Big Data 29 min read

Why “Exactly‑Once” Doesn’t Guarantee Consistency in Stream Processing

This article examines the true meaning of consistency in stream computing, clarifies common misconceptions about exactly‑once semantics, formalizes consistency challenges, and reviews how major stream engines such as Google MillWheel, Apache Flink, Kafka Streams, and Spark Streaming implement end‑to‑end consistency.

Alibaba Cloud Developer

Oct 13, 2021

Why “Exactly‑Once” Doesn’t Guarantee Consistency in Stream Processing

1 Consistency in Stream Computing

Stream processing is increasingly common in the big‑data field, with engines such as Google Dataflow, Apache Flink, Apache Kafka Streams, and Apache Spark Streaming. Consistency in a stream system is usually expressed via processing semantics; an engine may claim "Exactly‑once" processing, implying it can guarantee data consistency. In fact, "Exactly‑once" does not automatically mean the output data meets consistency requirements, and the term is often misunderstood.

The article starts from the essence of stream computing, focusing on consistency issues, providing a simple formal definition, and offering a perspective on the evolution of current stream engines to deepen understanding and aid technology selection.

1 Consistency in Stream Computing

Before defining consistency, we must precisely define streaming. Streaming computation processes unbounded data with low latency, whereas batch computation handles bounded data. The two are not mutually exclusive; for example, Spark Streaming’s micro‑batch can implement streaming over unbounded data.

1 Consistency Definition and Challenges

If we view the streaming process (input, processing, output) as a database master‑slave sync or as generating derived datasets, consistency in streaming mirrors the Consistency aspect of ACID transactions: the internal state and external output should remain consistent before and after a failure‑recovery. Challenges arise because streaming inputs are unbounded, leading to unpredictable latency, out‑of‑order arrival, and unknown volume, making consistency far more complex than in batch processing. Additionally, failures such as traffic spikes, network jitter, or resource issues further complicate robust fault‑tolerance.

Output data is consumed in real time, introducing extra consistency concerns, e.g., whether to retract previously emitted data after a failure or negotiate consistency with downstream systems.

2 Demystifying Consistency‑Related Concepts

Understanding the true scope of consistency is crucial for building correct, robust streaming jobs. Below are key concepts:

Exactly‑once ≠ Exactly‑consistent : Many engines use "Exactly‑once" to suggest that processing each message exactly once guarantees consistent output. However, the term is ambiguous because the verb or object after "Exactly‑once" is omitted, leading to different meanings.

Example 1: "Exactly‑once Delivery" (message‑level guarantee) vs. "Exactly‑once Process" (application‑level guarantee).

Example 2: "Exactly‑once State Consistency" (internal state persisted once) vs. "Exactly‑once Process Consistency" (end‑to‑end output consistency). The former only ensures state persistence, not output consistency.

Therefore, whenever you encounter an "Exactly‑once XXX" claim, be aware of what the engine actually guarantees.

End‑to‑End Data Consistency

End‑to‑end consistency means that the output data is part of the engine’s consistency design. The internal state, processing, and output must all be consistent throughout the entire streaming application.

2 The Essence of Stream Computing Systems

Having defined consistency, we now formalize the problem to derive generic solutions.

1 Re‑examining Stream Computing

According to the book *The System Streaming*, any data processing can be seen as transformations between Stream (moving data) and Table (static data):

Stream → Stream: processing without aggregation.

Stream → Table: processing with aggregation.

Table → Stream: triggering output when a table changes.

Table → Table: not applicable in streaming.

Both batch and streaming can be abstracted as "read‑process‑write" stages, represented by a directed acyclic graph where nodes are processing logic and edges are data flow; intermediate state usually needs persistent storage.

2 Deterministic vs. Non‑Deterministic Computing

Deterministic processing yields the same result given the same input, regardless of execution order; non‑deterministic processing (e.g., using random numbers or system time) breaks this property and makes end‑to‑end consistency harder.

3 Formal Definition of the Consistency Problem

Let the input set at time t be E(t), the output set be Sink(t), and the engine state be State(t) (including operator state and source offsets). Then: State(t) = OperatorState(t) + SourceState(t) The computation function F satisfies: F(E(t), Sink(t), State(t)) = Sink(t+1) + State(t+1) Defining O(t) = Sink(t) + State(t), we simplify to: F(E(t), O(t)) = O(t+1) When a failure occurs, a recovery function R should ensure: R(E(t), O(t)) = O'(t+1) and O'(t+1) = O(t+1).

Thus, storing every intermediate result (or periodically storing batches transactionally) is a sufficient condition for end‑to‑end consistency.

3 Generic Solutions for Consistency

1 Deriving the Generic Solution

If a failure occurs between t and t+1, the recovery function must mask the fault’s side effects so that users see the correct O(t+1). The approach is to replay E(t) and reload O(t) from persistent storage, then re‑execute F. However, if the computation contains non‑deterministic logic, re‑execution may produce a different result, so storing each intermediate result (or batching transactionally) is essential.

2 Engineering Implementation of the Generic Solution

Implementing this requires high‑availability, high‑throughput storage for state. When storing each result is infeasible, periodic transactional batch storage (e.g., using 2‑phase commit) can provide the same guarantees while preserving throughput.

4 Engine‑Specific Consistency Implementations

1 Google MillWheel

MillWheel uses a "Strong production" mechanism to persist every operator’s output before sending downstream. Upon failure, persisted results are replayed. It assigns a unique ID to each record and maintains a directory to deduplicate processing, often using a Bloom filter for efficient checks.

2 Apache Flink

Flink periodically creates distributed consistent snapshots (based on the Chandy‑Lamport algorithm) and commits them transactionally using a two‑phase commit. Snapshots include source offsets, operator state, and optionally output results, stored in RocksDB. This satisfies the real‑time or batch‑transactional storage condition, though it introduces some end‑to‑end latency due to epoch intervals.

3 Apache Kafka Streams

Kafka Streams is a Java library that builds stateful real‑time applications on top of Kafka. It persists source offsets, operator changelogs, and user‑defined output topics transactionally via Kafka’s Transactions API, ensuring that downstream consumers only see committed results.

4 Apache Spark Streaming

Classic Spark Streaming uses micro‑batches. It guarantees internal processing consistency but does not provide built‑in end‑to‑end guarantees; achieving them requires persisting each RDD’s state and ensuring transactional output. When the computation is deterministic and the output is idempotent, Spark Streaming can meet end‑to‑end consistency.

5 Summary of Engine Implementations

All major engines satisfy the core condition for end‑to‑end consistency: either real‑time storage of every intermediate result or periodic transactional batch storage. This can be viewed as making each processing unit idempotent, allowing fail‑over to replay inputs while using stored results to avoid duplication.

5 Conclusion and Outlook

The article derived a generic solution for achieving end‑to‑end consistency in stream processing and examined how Google MillWheel, Apache Flink, Kafka Streams, and Spark Streaming implement it. While each engine takes a different architectural path, they all rely on persisting state and ensuring idempotent or transactional output. Future work includes handling windowing, late data, and communication fault‑tolerance, which also affect consistency.

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.