Flink Web UI Monitoring and End‑to‑End Latency Implementation Guide
This article explains the key monitoring items of the Flink Web UI, details task topology, operator and system metrics, checkpoint and log inspection, and provides two practical solutions—custom metrics and distributed tracing—to measure and visualize full‑chain latency in Flink jobs.
1. Overview
Task Basic Information
Task status (RUNNING/FAILING/FINISHED): quickly determine if the job is running normally.
Parallelism: total parallelism and per‑operator parallelism to assess resource allocation.
Data rate: input/output records per second, indicating stable throughput.
Latency Metrics
Event‑time lag: difference between current event time and processing time when event‑time is enabled.
Processing‑time latency: average time a record spends in an operator, useful for locating bottlenecks.
Backpressure Status
Displayed as colored blocks (green/yellow/red) per operator; red indicates data processing blockage that requires immediate investigation.
Checkpoint Status
Shows checkpoint success/failure, duration and size to judge state management health.
2. Job Graph (Task Topology)
Operator Visualization
Graphical view of operators (Source/Map/Window/Sink) and their subtasks; arrows indicate data flow direction.
Color intensity reflects load (red = high load) for quick bottleneck identification.
3. Metrics
Operator‑Level Metrics
numRecordsIn / numRecordsOut: input/output record rates, used to detect data backlog.
latency: average processing latency per record (ms); high values may stem from complex logic or insufficient resources.
bufferDebits: frequency of downstream data requests; high frequency can indicate downstream slowness (backpressure).
TaskManager Metrics
Resource utilization: CPU usage, heap/non‑heap memory, JVM GC time to locate resource bottlenecks.
Network throughput: network out/in rates to determine if bandwidth limits cause latency.
JobManager Metrics
Checkpoint related: numCompletedCheckpoints, latestCheckpointDuration, pendingCheckpoints for checkpoint tuning.
Scheduling metrics: e.g., scheduledTasks to evaluate resource allocation.
4. Task Manager
Node Status
Shows each TaskManager's IP, port, managed task count and resource usage trends to identify faulty or unevenly loaded nodes.
Backpressure Detection
Lightweight detection via thread‑stack sampling (OK/LOW/HIGH); high backpressure requires downloading thread dumps for deeper analysis.
Thread dump: detailed stack information to pinpoint code‑level blockage such as lock contention or I/O wait.
5. Checkpoint Monitoring
Checkpoint List
Displays status, start time, duration, size and interval of all checkpoints; failed checkpoints need log analysis (e.g., oversized state, storage errors).
Performance Metrics
Alignment time: time operators wait for watermarks during checkpoint; long times can hurt throughput.
State size: checkpoint data volume; continuous growth requires state cleanup or window lifecycle adjustment.
Restart Time
Interval and count of automatic restarts after failures to assess job stability.
6. Logs
Real‑time Log Snippets
View the latest TaskManager/JobManager logs directly in the Web UI to quickly locate errors such as OOM or operator exceptions.
Full Log Download
Download log files for deep analysis and correlate with metrics to trace root causes like code bugs or dependency conflicts.
How to Monitor Full‑Chain Latency?
Full‑chain latency measures the time from data entering a Flink job (Source) to leaving it (Sink). The Web UI does not provide this metric directly; two common approaches are presented.
Solution 1: Custom End‑to‑End Latency Metric
Record entry time in the Source and exit time in the Sink, then expose the latency via Flink Metrics.
In the Source, mark the entry timestamp.
public class CustomSource extends SourceFunction<Event> {
private final MetricGroup metricGroup; // obtained via RichSourceFunction
private final Counter startTimer; // records entry timestamp
@Override
public void open(Configuration parameters) {
metricGroup = getRuntimeContext().getMetricGroup();
startTimer = metricGroup.counter("source_entry_timestamp");
}
@Override
public void run(SourceContext<Event> ctx) {
while (true) {
Event event = fetchNextEvent(); // simulate data fetch
startTimer.set(System.currentTimeMillis()); // record entry time
ctx.collect(event);
Thread.sleep(100);
}
}
}In the Sink, compute latency and report it.
public class CustomSink extends RichSinkFunction<Event> {
private final MetricGroup metricGroup;
private final Histogram endToEndLatency; // stores latency distribution
@Override
public void open(Configuration parameters) {
metricGroup = getRuntimeContext().getMetricGroup();
endToEndLatency = metricGroup.histogram("end_to_end_latency", new Histogram());
}
@Override
public void invoke(Event event, Context context) {
long sourceEntryTime = event.getSourceEntryTime(); // extracted from event
long sinkExitTime = System.currentTimeMillis();
long latency = sinkExitTime - sourceEntryTime;
endToEndLatency.update(latency); // report to Metrics
// write to external system (DB, Kafka, etc.)
}
}View the metric in the Web UI or external monitoring (Prometheus/Grafana).
# Average end‑to‑end latency over the last 5 minutes
avg_over_time(flink_metrics_end_to_end_latency_count[5m]) /
avg_over_time(flink_metrics_end_to_end_latency_sum[5m])Solution 2: Distributed Tracing Integration (OpenTelemetry / Zipkin / SkyWalking)
Generate a unique Trace ID for each record, propagate it through operators, and report timestamps to a tracing system for visual latency analysis.
Generate Trace ID in the Source.
DataStream<Event> stream = env.addSource(new KafkaSource<>())
.map(record -> {
String traceId = UUID.randomUUID().toString();
return new Event(record.getValue(), traceId, System.currentTimeMillis());
});Inject tracing information in operators.
stream.transform("OperatorName", TypeInformation.of(Event.class), new TracingFunction<>());Use the tracing system to view per‑operator latency via the Trace ID, locating bottlenecks.
In production, the preferred approach is to combine custom metrics with Prometheus/Grafana and optionally add distributed tracing for full‑chain visualization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
