Why “Exactly‑Once” Doesn’t Guarantee Consistency in Stream Processing
This article examines the true meaning of consistency in stream computing, clarifies common misconceptions about exactly‑once semantics, formalizes consistency challenges, and reviews how major stream engines such as Google MillWheel, Apache Flink, Kafka Streams, and Spark Streaming implement end‑to‑end consistency.
1 Consistency in Stream Computing
Stream processing is increasingly common in the big‑data field, with engines such as Google Dataflow, Apache Flink, Apache Kafka Streams, and Apache Spark Streaming. Consistency in a stream system is usually expressed via processing semantics; an engine may claim "Exactly‑once" processing, implying it can guarantee data consistency. In fact, "Exactly‑once" does not automatically mean the output data meets consistency requirements, and the term is often misunderstood.
The article starts from the essence of stream computing, focusing on consistency issues, providing a simple formal definition, and offering a perspective on the evolution of current stream engines to deepen understanding and aid technology selection.
1 Consistency in Stream Computing
Before defining consistency, we must precisely define streaming. Streaming computation processes unbounded data with low latency, whereas batch computation handles bounded data. The two are not mutually exclusive; for example, Spark Streaming’s micro‑batch can implement streaming over unbounded data.
1 Consistency Definition and Challenges
If we view the streaming process (input, processing, output) as a database master‑slave sync or as generating derived datasets, consistency in streaming mirrors the Consistency aspect of ACID transactions: the internal state and external output should remain consistent before and after a failure‑recovery. Challenges arise because streaming inputs are unbounded, leading to unpredictable latency, out‑of‑order arrival, and unknown volume, making consistency far more complex than in batch processing. Additionally, failures such as traffic spikes, network jitter, or resource issues further complicate robust fault‑tolerance.
Output data is consumed in real time, introducing extra consistency concerns, e.g., whether to retract previously emitted data after a failure or negotiate consistency with downstream systems.
2 Demystifying Consistency‑Related Concepts
Understanding the true scope of consistency is crucial for building correct, robust streaming jobs. Below are key concepts:
Exactly‑once ≠ Exactly‑consistent : Many engines use "Exactly‑once" to suggest that processing each message exactly once guarantees consistent output. However, the term is ambiguous because the verb or object after "Exactly‑once" is omitted, leading to different meanings.
Example 1: "Exactly‑once Delivery" (message‑level guarantee) vs. "Exactly‑once Process" (application‑level guarantee).
Example 2: "Exactly‑once State Consistency" (internal state persisted once) vs. "Exactly‑once Process Consistency" (end‑to‑end output consistency). The former only ensures state persistence, not output consistency.
Therefore, whenever you encounter an "Exactly‑once XXX" claim, be aware of what the engine actually guarantees.
End‑to‑End Data Consistency
End‑to‑end consistency means that the output data is part of the engine’s consistency design. The internal state, processing, and output must all be consistent throughout the entire streaming application.
2 The Essence of Stream Computing Systems
Having defined consistency, we now formalize the problem to derive generic solutions.
1 Re‑examining Stream Computing
According to the book *The System Streaming*, any data processing can be seen as transformations between Stream (moving data) and Table (static data):
Stream → Stream: processing without aggregation.
Stream → Table: processing with aggregation.
Table → Stream: triggering output when a table changes.
Table → Table: not applicable in streaming.
Both batch and streaming can be abstracted as "read‑process‑write" stages, represented by a directed acyclic graph where nodes are processing logic and edges are data flow; intermediate state usually needs persistent storage.
2 Deterministic vs. Non‑Deterministic Computing
Deterministic processing yields the same result given the same input, regardless of execution order; non‑deterministic processing (e.g., using random numbers or system time) breaks this property and makes end‑to‑end consistency harder.
3 Formal Definition of the Consistency Problem
Let the input set at time t be E(t), the output set be Sink(t), and the engine state be State(t) (including operator state and source offsets). Then: State(t) = OperatorState(t) + SourceState(t) The computation function F satisfies: F(E(t), Sink(t), State(t)) = Sink(t+1) + State(t+1) Defining O(t) = Sink(t) + State(t), we simplify to: F(E(t), O(t)) = O(t+1) When a failure occurs, a recovery function R should ensure: R(E(t), O(t)) = O'(t+1) and O'(t+1) = O(t+1).
Thus, storing every intermediate result (or periodically storing batches transactionally) is a sufficient condition for end‑to‑end consistency.
3 Generic Solutions for Consistency
1 Deriving the Generic Solution
If a failure occurs between t and t+1, the recovery function must mask the fault’s side effects so that users see the correct O(t+1). The approach is to replay E(t) and reload O(t) from persistent storage, then re‑execute F. However, if the computation contains non‑deterministic logic, re‑execution may produce a different result, so storing each intermediate result (or batching transactionally) is essential.
2 Engineering Implementation of the Generic Solution
Implementing this requires high‑availability, high‑throughput storage for state. When storing each result is infeasible, periodic transactional batch storage (e.g., using 2‑phase commit) can provide the same guarantees while preserving throughput.
4 Engine‑Specific Consistency Implementations
1 Google MillWheel
MillWheel uses a "Strong production" mechanism to persist every operator’s output before sending downstream. Upon failure, persisted results are replayed. It assigns a unique ID to each record and maintains a directory to deduplicate processing, often using a Bloom filter for efficient checks.
2 Apache Flink
Flink periodically creates distributed consistent snapshots (based on the Chandy‑Lamport algorithm) and commits them transactionally using a two‑phase commit. Snapshots include source offsets, operator state, and optionally output results, stored in RocksDB. This satisfies the real‑time or batch‑transactional storage condition, though it introduces some end‑to‑end latency due to epoch intervals.
3 Apache Kafka Streams
Kafka Streams is a Java library that builds stateful real‑time applications on top of Kafka. It persists source offsets, operator changelogs, and user‑defined output topics transactionally via Kafka’s Transactions API, ensuring that downstream consumers only see committed results.
4 Apache Spark Streaming
Classic Spark Streaming uses micro‑batches. It guarantees internal processing consistency but does not provide built‑in end‑to‑end guarantees; achieving them requires persisting each RDD’s state and ensuring transactional output. When the computation is deterministic and the output is idempotent, Spark Streaming can meet end‑to‑end consistency.
5 Summary of Engine Implementations
All major engines satisfy the core condition for end‑to‑end consistency: either real‑time storage of every intermediate result or periodic transactional batch storage. This can be viewed as making each processing unit idempotent, allowing fail‑over to replay inputs while using stored results to avoid duplication.
5 Conclusion and Outlook
The article derived a generic solution for achieving end‑to‑end consistency in stream processing and examined how Google MillWheel, Apache Flink, Kafka Streams, and Spark Streaming implement it. While each engine takes a different architectural path, they all rely on persisting state and ensuring idempotent or transactional output. Future work includes handling windowing, late data, and communication fault‑tolerance, which also affect consistency.
Further reading:
"Streaming System" – T. Akidau, S. Chernyak, R. Lax
"Transactions in Apache Kafka" – Apurva Mehta, Jason Gustafson
"A Survey of State Management in Big Data Processing Systems" – Q.C. To, J. Soto, V. Markl
"MillWheel: fault‑tolerant stream processing at Internet scale" – T. Akidau et al.
"Discretized Streams: Fault‑Tolerant Streaming Computation at Scale" – M. Zaharia et al.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
