Big Data 10 min read

Flink Web UI Monitoring and End‑to‑End Latency Implementation Guide

This article explains the key monitoring items of the Flink Web UI, details task topology, operator and system metrics, checkpoint and log inspection, and provides two practical solutions—custom metrics and distributed tracing—to measure and visualize full‑chain latency in Flink jobs.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Flink Web UI Monitoring and End‑to‑End Latency Implementation Guide

1. Overview

Task Basic Information

Task status (RUNNING/FAILING/FINISHED): quickly determine if the job is running normally.

Parallelism: total parallelism and per‑operator parallelism to assess resource allocation.

Data rate: input/output records per second, indicating stable throughput.

Latency Metrics

Event‑time lag: difference between current event time and processing time when event‑time is enabled.

Processing‑time latency: average time a record spends in an operator, useful for locating bottlenecks.

Backpressure Status

Displayed as colored blocks (green/yellow/red) per operator; red indicates data processing blockage that requires immediate investigation.

Checkpoint Status

Shows checkpoint success/failure, duration and size to judge state management health.

2. Job Graph (Task Topology)

Operator Visualization

Graphical view of operators (Source/Map/Window/Sink) and their subtasks; arrows indicate data flow direction.

Color intensity reflects load (red = high load) for quick bottleneck identification.

3. Metrics

Operator‑Level Metrics

numRecordsIn / numRecordsOut: input/output record rates, used to detect data backlog.

latency: average processing latency per record (ms); high values may stem from complex logic or insufficient resources.

bufferDebits: frequency of downstream data requests; high frequency can indicate downstream slowness (backpressure).

TaskManager Metrics

Resource utilization: CPU usage, heap/non‑heap memory, JVM GC time to locate resource bottlenecks.

Network throughput: network out/in rates to determine if bandwidth limits cause latency.

JobManager Metrics

Checkpoint related: numCompletedCheckpoints, latestCheckpointDuration, pendingCheckpoints for checkpoint tuning.

Scheduling metrics: e.g., scheduledTasks to evaluate resource allocation.

4. Task Manager

Node Status

Shows each TaskManager's IP, port, managed task count and resource usage trends to identify faulty or unevenly loaded nodes.

Backpressure Detection

Lightweight detection via thread‑stack sampling (OK/LOW/HIGH); high backpressure requires downloading thread dumps for deeper analysis.

Thread dump: detailed stack information to pinpoint code‑level blockage such as lock contention or I/O wait.

5. Checkpoint Monitoring

Checkpoint List

Displays status, start time, duration, size and interval of all checkpoints; failed checkpoints need log analysis (e.g., oversized state, storage errors).

Performance Metrics

Alignment time: time operators wait for watermarks during checkpoint; long times can hurt throughput.

State size: checkpoint data volume; continuous growth requires state cleanup or window lifecycle adjustment.

Restart Time

Interval and count of automatic restarts after failures to assess job stability.

6. Logs

Real‑time Log Snippets

View the latest TaskManager/JobManager logs directly in the Web UI to quickly locate errors such as OOM or operator exceptions.

Full Log Download

Download log files for deep analysis and correlate with metrics to trace root causes like code bugs or dependency conflicts.

How to Monitor Full‑Chain Latency?

Full‑chain latency measures the time from data entering a Flink job (Source) to leaving it (Sink). The Web UI does not provide this metric directly; two common approaches are presented.

Solution 1: Custom End‑to‑End Latency Metric

Record entry time in the Source and exit time in the Sink, then expose the latency via Flink Metrics.

In the Source, mark the entry timestamp.

public class CustomSource extends SourceFunction<Event> {
    private final MetricGroup metricGroup; // obtained via RichSourceFunction
    private final Counter startTimer; // records entry timestamp

    @Override
    public void open(Configuration parameters) {
        metricGroup = getRuntimeContext().getMetricGroup();
        startTimer = metricGroup.counter("source_entry_timestamp");
    }

    @Override
    public void run(SourceContext<Event> ctx) {
        while (true) {
            Event event = fetchNextEvent(); // simulate data fetch
            startTimer.set(System.currentTimeMillis()); // record entry time
            ctx.collect(event);
            Thread.sleep(100);
        }
    }
}

In the Sink, compute latency and report it.

public class CustomSink extends RichSinkFunction<Event> {
    private final MetricGroup metricGroup;
    private final Histogram endToEndLatency; // stores latency distribution

    @Override
    public void open(Configuration parameters) {
        metricGroup = getRuntimeContext().getMetricGroup();
        endToEndLatency = metricGroup.histogram("end_to_end_latency", new Histogram());
    }

    @Override
    public void invoke(Event event, Context context) {
        long sourceEntryTime = event.getSourceEntryTime(); // extracted from event
        long sinkExitTime = System.currentTimeMillis();
        long latency = sinkExitTime - sourceEntryTime;
        endToEndLatency.update(latency); // report to Metrics
        // write to external system (DB, Kafka, etc.)
    }
}

View the metric in the Web UI or external monitoring (Prometheus/Grafana).

# Average end‑to‑end latency over the last 5 minutes
avg_over_time(flink_metrics_end_to_end_latency_count[5m]) /
avg_over_time(flink_metrics_end_to_end_latency_sum[5m])

Solution 2: Distributed Tracing Integration (OpenTelemetry / Zipkin / SkyWalking)

Generate a unique Trace ID for each record, propagate it through operators, and report timestamps to a tracing system for visual latency analysis.

Generate Trace ID in the Source.

DataStream<Event> stream = env.addSource(new KafkaSource<>())
    .map(record -> {
        String traceId = UUID.randomUUID().toString();
        return new Event(record.getValue(), traceId, System.currentTimeMillis());
    });

Inject tracing information in operators.

stream.transform("OperatorName", TypeInformation.of(Event.class), new TracingFunction<>());

Use the tracing system to view per‑operator latency via the Trace ID, locating bottlenecks.

In production, the preferred approach is to combine custom metrics with Prometheus/Grafana and optionally add distributed tracing for full‑chain visualization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringBig DataFlinkMetricsLatencyDistributed TracingWeb UI
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.