Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage
The article explains Flink’s metrics architecture—core concepts, reporter interfaces, built‑in and custom metric types, elastic plugin design, and scheduled reporting—illustrated with a consumption‑latency example, and shows how Didi uses these metrics for real‑time UI curves, alerts, and intelligent task diagnosis.
Flink metrics are essential for observing the health of Flink jobs, acting as the eyes of a Flink task. They form the data graph of the real‑time operation system on the Didi data development platform, enabling alerts such as consumption latency and checkpoint failures.
The article first introduces the core concepts of the Flink metrics system. MetricReporters are the interfaces used to export metric data; they can be configured via flink-conf.yaml and have implementations such as Prometheus and Datadog. Didi does not use the community reporters but has built a custom flink-metrics-kafka that pushes metrics to a Kafka topic.
Four main metric types are defined by Flink: Counter, Gauge, Histogram, and Meter. Didi also created a custom type called Metered , which records average, max, min and count values over a time window. The View interface provides periodic refresh capability for a metric, and Scope works as a naming‑space prefix to distinguish metric sources.
The elastic design of the Flink metrics system is explained next. The MetricReporter interface defines open and close methods, as well as notifyOfAddedMetric and notifyOfRemovedMetric . Reporters are instantiated via a MetricReporterFactory loaded through Java SPI (or a custom PluginManager). The ReporterSetup class loads factories, creates reporter instances, and registers them with a MetricRegistry .
Periodic reporting is achieved by reporters that implement the Scheduled interface. The MetricRegistry maintains a ScheduledExecutorService that triggers ReporterTask according to the configured reporting interval (e.g., 10 seconds or 1 minute).
An example is given for the consumption‑latency metric. The metric is registered by ConnectorSourceIOMetricGroup , calculated as the maximum of (source‑machine timestamp – message‑queue timestamp). The calculation is performed by a custom MeteredView , which keeps two circular buffers of size 13 to store recent latency values and report counts, and updates the 1‑minute average every 5 seconds.
Finally, the article describes three main usage scenarios of Flink metrics inside Didi: (1) data curves displayed in the real‑time monitoring UI, (2) monitoring and alerting (e.g., consumption‑latency and checkpoint‑failure alerts), and (3) intelligent diagnosis of real‑time tasks, where decision trees rely on a large set of Flink metrics.
The author concludes that studying Flink’s metrics architecture and its practical applications provides deep insights into both framework design (open‑closed principle, plugin mechanism) and operational monitoring of large‑scale streaming jobs.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.