Building an Enterprise‑Level Flink Monitoring System with Prometheus, Grafana and Pushgateway
This article explains how to use the Cloud Native Prometheus ecosystem—including Prometheus Server, exporters, Pushgateway, Alertmanager and Grafana—to collect, store, query and visualize Flink job metrics, providing a complete monitoring solution for production clusters.
Before diving into the tutorial, the author asks how companies currently monitor data‑sync, real‑time computation, or scheduling jobs on production clusters, often answering with custom solutions, ELK, or Zabbix, and then promises a better alternative.
Prometheus, originally created by former Google engineers in 2012 and graduated to the CNCF in 2016, has become the de‑facto monitoring and alerting system for Kubernetes and the broader Cloud Native ecosystem.
Key advantages of Prometheus include a flexible data model with labels, powerful PromQL query language, a rich ecosystem of exporters and client libraries, strong performance (up to 100k samples per second per instance), a pull‑based architecture that simplifies service discovery, and a vibrant open‑source community.
The Prometheus architecture consists of a Prometheus Server that scrapes metrics from targets or Pushgateway, stores them in a built‑in TSDB, evaluates rules, and forwards alerts to Alertmanager; optional components such as exporters, client libraries, Pushgateway and various visualization tools (e.g., Grafana) extend its capabilities.
Metrics are stored as time‑series identified by a metric name and a set of labels; each sample contains a metric, a millisecond‑precision timestamp, and a float64 value. Prometheus defines four core metric types: Counter, Gauge, Histogram and Summary, which correspond to Flink’s own metric types.
Installation steps:
tar xvfz prometheus-*.tar.gz
cd prometheus-* # view version
./prometheus --version
# start server
./prometheus --config.file=prometheus.ymlGrafana can be installed via RPM and started with:
rpm -ivh grafana-6.5.2-1.x86_64.rpm
service grafana-server startNode Exporter and Pushgateway are installed similarly, then added to the Prometheus scrape configuration.
Configuration examples:
# flink.yaml
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: node1
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: flinkjobs
metrics.reporter.promgateway.randomJobNameSuffix: false
metrics.reporter.promgateway.deleteOnShutdown: true # prometheus.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'prometheus'
- job_name: 'linux'
static_configs:
- targets: ['localhost:9100']
labels:
instance: 'localhost'
- job_name: 'pushgateway'
static_configs:
- targets: ['localhost:9091']
labels:
instance: 'pushgateway'After launching Flink, node manager, Pushgateway, Prometheus and Grafana, the Grafana dashboard can query Prometheus for Flink job metrics such as JobManager status, checkpoint progress, task manager memory, and operator traffic, enabling detailed performance analysis and back‑pressure detection.
An industry case from Tongcheng shows a production‑grade monitoring stack where a custom Go agent pushes metrics to Pushgateway, multiple Prometheus instances scrape data at 10‑second intervals, Alertmanager handles notifications, and Grafana visualizes the data; a single physical server handles 90‑100k samples per second with room for scaling.
Overall, the guide demonstrates how to assemble a Cloud Native, operation‑focused monitoring solution for Flink workloads using Prometheus and its ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
