Big Data 28 min read

Ensuring Correctness in Stream Computing: Data Integrity Challenges and Engine Solutions

This article explores how stream computing systems achieve correct results by addressing data integrity, distinguishing consistency from correctness, formalizing integrity inference, and comparing implementations across major engines such as Flink, Kafka Streams, MillWheel, and Spark Structured Streaming.

Alibaba Cloud Developer

Oct 13, 2022

Ensuring Correctness in Stream Computing: Data Integrity Challenges and Engine Solutions

Building on a previous article about end‑to‑end consistency, this piece examines the deeper challenge of obtaining truly correct results in stream computing, focusing on data integrity as the key to accurate, business‑semantic outputs.

1. Correctness in Stream Computing

Correctness means that the results produced by a stream engine faithfully reflect real‑world objects, e.g., a user who pays three invoices totalling 100 CNY should see a single metric with value 100. Achieving this is difficult because streams are unbounded and unordered, often leading to incomplete or duplicated results.

2. Conditions for Correctness

Two concepts are often confused: correctness (the result matches the physical world) and consistency (all upstream and downstream systems see the same information). Consistency is a necessary but not sufficient condition for correctness; without consistency, correctness cannot be guaranteed.

2.1 Consistency ≠ Correctness

Correctness requires that the engine’s output reflects the true state of objects, while consistency ensures that the same data is observed across all components of the pipeline.

2.2 Necessity of Data Integrity

Because stream data is unbounded and unordered, engines must partition the stream into logical windows (time‑based or count‑based) to reason about completeness. Without a notion of data integrity, downstream decisions may be based on partial or stale information.

3. General Solutions for Data Integrity

The article introduces a formal model for integrity inference and derives three common engineering approaches: reordering, low‑watermark, and slack‑time.

3.1 Formal Definition

Stream programs are modeled as directed acyclic graphs where nodes are stateful operators and edges are data channels. Each event carries an event‑time and a processing‑time. An integrity signal Signal(t) must satisfy Signal(t) < ET(e) for any event e in the current input set, allowing operators to determine whether the current window is complete.

3.2 System Design

Integrity inference consists of three modules: production (generating the signal), propagation (broadcasting it through the topology), and consumption (using the signal to close windows or clean state).

3.3 Engineering Implementation

Systems are classified as In‑Order Processing (IOP) – which buffers and reorders events – and Out‑of‑Order Processing (OOP) – which relies on signals such as punctuation, low‑watermark, slack‑time, or heartbeat. IOP offers strong guarantees but adds latency; OOP reduces latency but requires careful signal design.

4. Engine Implementations

The article compares how four major engines handle integrity inference.

4.1 MillWheel / Cloud Dataflow

Uses low‑watermark generated at each operator, persisted locally, and aggregated centrally. Persistence speeds failover but increases latency.

4.2 Apache Flink

Relies on low‑watermark messages that flow with data; watermarks are not persisted, yielding lower latency but longer recovery times.

4.3 Apache Kafka Streams

Adopts a slack‑time approach where each operator adds a fixed delay to event‑time, avoiding explicit watermark messages but risking premature window closure when upstream operators lag.

4.4 Apache Spark Structured Streaming

Implements a global watermark calculated as MAX(timestamp) - slack per batch; this simplifies state management but limits composability of multiple aggregations.

4.5 Comparative Overview

A side‑by‑side diagram (omitted) highlights differences in signal generation, propagation, and consumption across the four engines.

5. Summary and Outlook

Correct results require both engine‑level consistency and data‑level integrity. While low‑watermark and slack‑time are the most widely adopted techniques, they remain heuristic and cannot guarantee absolute correctness. Future work aims to provide stronger, user‑friendly integrity guarantees, possibly by moving watermark generation to data sources or by adopting richer progress‑tracking mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing data integrity kafka streams correctness

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.