Big Data 6 min read

Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details

This article explains the background, source‑code analysis, and practical implementation of Flink's LatencyMarker feature for measuring end‑to‑end job latency, including metric exposure, configuration options, and code snippets illustrating how latency markers are emitted and processed within the streaming pipeline.

Big Data Technology & Architecture

Dec 1, 2019

Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details

Background – Flink job end‑to‑end latency is a critical metric for assessing overall performance and response time of streaming applications, especially those requiring low latency such as login, order rule checking, and real‑time prediction.

Through a comparison of competing stream‑processing engines, it was observed that most provide full‑chain latency metrics on their monitoring dashboards, prompting the need for a measurable metric in Flink.

Source‑code analysis origin – The analysis is based on Flink community issue FLINK‑3660 and the corresponding pull request (pull‑2386). The PR only covers part of the full‑chain latency implementation, so the article summarizes the relevant components:

Source‑to‑Sink latency marker flow

LatencyMarksEmitter class

LatencyStats (histogram metric) implementation

Overall latency measurement architecture diagram

Tencent Oceanus monitoring reference – Screenshots from Oceanus illustrate the latency metric (highlighted in red) that corresponds to the end‑to‑end delay measured by Flink.

Flink LatencyMarker implementation idea – To expose job latency in the Web UI, Flink initially considered attaching an ingestion‑time timestamp to each record, but this added overhead for users not using the monitoring feature. Instead, Flink emits special events called LatencyMarker at configurable intervals (default 0 ms, configurable via ExecutionConfig#latencyTrackingInterval), similar to watermarks.

LatencyMarkers travel through the job graph like regular records, are randomly forwarded by operators, and are finally processed by the sink, which compares the marker timestamp with the current system time to compute latency. The design assumes synchronized clocks across TaskManagers; otherwise, clock drift will affect measurements. A possible improvement is to use the JobManager as a central timing service.

Relevant source code

public class RecordWriterOutput {
  @Override
  public void emitLatencyMarker(LatencyMarker latencyMarker) {
    serializationDelegate.setInstance(latencyMarker);
    try {
      // internal implementation randomly selects a channel
      recordWriter.randomEmit(serializationDelegate);
    }
    catch (Exception e) {
      throw new RuntimeException(e.getMessage(), e);
    }
  }
}

This method is invoked by StreamSource and AbstractStreamOperator to dispatch latency markers from sources and intermediate operators.

If an operator is a sink, it maintains the last 512 latency markers for each source and aggregates min/ max/ avg/ p50/ p95/ p99 values in a LatencyStats object, which can be exposed via the metrics system (path:

TaskManagerJobMetricGroup/operator_id/operator_subtask_index/latency

Summary

LatencyMarker does not participate in window or MiniBatch buffering; it is forwarded directly by intermediate operators.

Metrics are reported per TaskManager and can be used to pinpoint tasks with high latency.

Configurable emission intervals (e.g., 20 seconds) have minimal impact on data processing performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink stream processing Metrics LatencyMarker End-to-End Latency

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.