Big Data 16 min read

Understanding Watermarks in Real-Time Stream Processing with Apache Flink

This article explains the concept of Watermarks in stream processing, detailing their background, theoretical foundations from the Dataflow model, practical implementation in Apache Flink with code examples, and discusses trade‑offs between latency and accuracy for real‑time analytics.

NetEase Game Operations Platform
NetEase Game Operations Platform
NetEase Game Operations Platform
Understanding Watermarks in Real-Time Stream Processing with Apache Flink

Watermark Background

Stream processing has rapidly matured and now competes with traditional batch processing. A key challenge is handling out‑of‑order event timestamps, which can cause inaccurate results or excessive latency. Watermarks serve as a bridge between result correctness and processing delay.

Real‑time vs. Batch Computing

Batch jobs wait for the entire dataset to arrive before producing results, while stream jobs must emit results continuously on an unbounded data flow. Early approaches such as micro‑batching (e.g., Spark Streaming) introduced fixed‑interval windows but still suffered from latency and accuracy issues.

The industry therefore adopted the Lambda architecture, running both a low‑latency real‑time pipeline and a high‑accuracy batch pipeline, later evolving toward unified models like Google’s Dataflow Model, which treats computation as a combination of what , where , when , and how .

Watermark Principle

A Watermark is a timestamp indicating that all events with an event‑time earlier than the Watermark have been observed. It allows the system to consider a window complete and safely emit results. In practice, Watermarks are derived from observed event times, often using the maximum observed timestamp minus a safety margin.

Two extreme strategies exist: an aggressive Watermark that closely follows the maximum event time (risking premature results) and a conservative Watermark that lags far behind (introducing delay). Most systems choose a heuristic that balances latency and accuracy, sometimes adding an "allow lateness" threshold to re‑process slightly late events.

Practical Watermark Implementation in Flink

For a gaming scenario where per‑minute active player counts are needed across servers with differing network delays, the author proposes computing a Watermark per server and then taking the minimum across servers. Within each server, a periodic Watermark generator is used.

Flink provides two interfaces for Watermark emission: AssignerWithPunctuatedWatermarks (event‑driven) and AssignerWithPeriodicWatermarks (time‑driven). The example below uses the periodic approach with a 5‑second interval and a configurable event‑time skew of 60 seconds.

public class WatermarkProcessor implements AssignerWithPeriodicWatermarks
{

    private static final long ALLOWED_EVENT_TIME_SKEW = 60000L;
    private static final Map
maxTimestampPerServer = new HashMap<>(3);

    @Nullable
    public Watermark getCurrentWatermark() {
        Optional
maxTimestamp = maxTimestampPerServer.values().stream()
                .min(Comparator.comparingLong(Long::valueOf));
        if (maxTimestamp.isPresent()) {
            return new Watermark(maxTimestamp.get() - ALLOWED_EVENT_TIME_SKEW);
        } else {
            return null;
        }
    }

    public long extractTimestamp(UserLogin userLogin, long previousElementTimestamp) {
        String server = userLogin.getServer();
        long eventTime = userLogin.getEventTime();
        if (!maxTimestampPerServer.containsKey(server) || userLogin.getEventTime() > maxTimestampPerServer.get(server)) {
            maxTimestampPerServer.put(server, eventTime);
        }
        return eventTime;
    }
}

The code tracks the maximum event time per server, subtracts the allowed skew to produce the Watermark, and updates the maximum when newer timestamps arrive.

Balancing Latency and Accuracy

Choosing the skew value is critical: a small skew yields low latency but may drop late events; a larger skew improves completeness at the cost of delayed results. In many gaming analytics scenarios, a default skew of one minute provides a reasonable trade‑off.

Conclusion

Watermarks are essential for building low‑latency, high‑accuracy stream processing applications. Their design must consider both processing‑time lag and event‑time skew, and practical implementations—such as the Flink example above—demonstrate how to handle out‑of‑order data across heterogeneous sources.

References

1. How to beat the CAP theorem (http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html)

2. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive Scale, Unbounded, Out of Order Data Processing (https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)

3. Streaming 102: The world beyond batch (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102)

4. Tyler Akidau, Slava Chernyak, Reuven Lax. Streaming Systems (2018).

big datastream processingreal-time analyticsWatermarkApache Flinkevent time
NetEase Game Operations Platform
Written by

NetEase Game Operations Platform

The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.