Real-Time Stream Computation in Monitoring Systems: Data Streams, Windows, and Watermarks with Apache Flink
This article explains the role of monitoring systems, introduces real-time data stream computation, describes data stream characteristics, details Flink’s event time and processing time concepts, various window types, watermark mechanisms, and strategies for handling out-of-order and late data.
1. Monitoring System Overview
Monitoring systems play a crucial role in modern technology environments. Operations staff check activity data daily, while developers verify system metrics. A typical monitoring system includes data collection, computation, storage, visualization, and alerting. This article focuses on the data computation component.
2. Real-Time Computation
Real-time stream computation processes and analyzes continuous data streams, allowing enterprises to extract value instantly from logs, sensor data, online transactions, etc. This model is essential for fast‑decision scenarios such as real‑time monitoring, online recommendation, and fraud detection. Apache Flink is a popular framework for implementing real‑time stream computation.
2.1 Data Stream
A data stream is a sequence of continuously generated data elements that may originate from various sources. Streams are typically dynamic, unbounded, and arrive at high speed.
Key characteristics of data streams include:
1.
Continuity
: Data arrives continuously without a defined start or end.
2.
Unboundedness
: In theory, a stream can continue indefinitely.
3.
Real‑time
: Processing often needs to be (near) real‑time to respond promptly.
4.
Variability
: Speed, size, and format of the data may change over time.
5.
Disorder
: In distributed systems, data may arrive out of order due to network latency.
[Data Source] → |Element1| → |Element2| → |Element3| → ... → [Data Processing] → [Data Storage/Output]2.2 Data Stream Processing
2.2.1 Time and Window in Stream Processing
Time
Event Time
Event time refers to the timestamp when an event is generated on its source device. This timestamp is embedded in the event before it reaches Flink and can be extracted for processing.
With event time, windowed aggregations (e.g., count per minute) become a special grouping on the event‑time column. An element can belong to multiple overlapping windows (as in sliding windows).
Processing Time
Processing time is the system time of the machine executing the Flink operation.
When a job runs on processing time, all time‑based operations (e.g., windows) use the machine’s clock. In distributed or asynchronous environments, processing time lacks determinism because it is affected by data arrival speed and internal processing delays.
Window
Unbounded streams have no natural boundaries, so computation requires explicit windows. Windows can be defined by time‑driven or data‑driven triggers.
2.2.2 Types of Windows
1. Tumbling Window
Tumbling windows split the stream into non‑overlapping, consecutive intervals of fixed length (e.g., 0‑5 min, 5‑10 min). Each window processes independently, suitable for periodic resets or counts.
2. Sliding Window
Sliding windows may overlap and are defined by a window length and a slide interval. For a 10‑min length with a 5‑min slide, windows are 0‑10, 5‑15, 10‑20, etc., providing smoother continuous output.
3. Session Window
Session windows have dynamic length, opening when activity occurs and closing after a period of inactivity (the timeout). They are useful for user‑behavior analysis such as web‑session tracking.
4. Global Window
A global window never closes based on time; it persists until an external signal or a data‑count threshold triggers processing. It is rarely used because most stream scenarios need time boundaries.
2.2.3 Window Lifecycle
Window Creation
Windows are created on‑demand when the first element that belongs to the window arrives.
Window Computation
The trigger condition depends on the window type. For an event‑time tumbling window, computation fires when the watermark passes the window’s end time; for a count window, it fires when the element count reaches the defined size.
Window Destruction
Typically, when the watermark exceeds the window’s end, the window is computed and its state cleared. Flink only provides a destruction mechanism for time windows; count windows are built on the global window and are not automatically cleared.
2.2.4 Operator Model
Flink operators are divided into sources, transforms, and sinks. Sources ingest data (e.g., text files, message queues), transforms perform aggregations and calculations, and sinks output results to storage or downstream systems.
Watermark Mechanism
In distributed systems, each node maintains its own logical clock. To synchronize event time across nodes, upstream operators must forward the event timestamp downstream.
When different downstream nodes receive different timestamps (e.g., source1 with timestamp 12 and source2 with 13), their logical clocks diverge, causing inconsistencies.
Watermarks are special records that carry a monotonically increasing timestamp. They indicate that all events with timestamps earlier than the watermark have arrived, allowing windows whose end time is less than the watermark to be triggered and closed.
Watermark Propagation
For ordered streams, watermarks and data are forwarded in order; each downstream task updates its logical clock to the watermark timestamp.
In unordered streams, watermarks are generated periodically using the maximum event time seen in the current period.
When operators run in parallel, watermark propagation becomes more complex. Downstream operators must consider the minimum watermark from all upstream partitions before advancing their logical clock.
Late Data Handling
Out‑of‑order or delayed data can cause incorrect results. Watermark delay (e.g., 2 seconds) allows the system to wait briefly for late events before triggering window computation.
By also configuring a window‑close delay (e.g., 5 seconds), windows remain open after computation, allowing severely late data to be processed via side‑output streams.
Setting appropriate watermark and window‑close delays balances the need for timely results with tolerance for late data.
Conclusion
Due to space limits, this article only covers a portion of data computation. Building a fault‑tolerant, precise computation service also requires handling operator failures, load imbalance, and other scenarios. Discussion and contributions are welcome.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.