Big Data 22 min read

How HuoLala Built a Real‑Time Metrics Monitoring Platform for Flink

This article explains how HuoLala’s real‑time R&D platform redesigns Flink metric collection, routing, and alerting using a custom Kafka‑based pipeline, flexible dashboards, and multi‑level metric governance to improve observability, reduce latency, and ensure data quality.

Huolala Tech
Huolala Tech
Huolala Tech
How HuoLala Built a Real‑Time Metrics Monitoring Platform for Flink

Background

Metrics are a key component of system observability, stability, and performance optimization. Flink provides a native metric system, but to be effective it must be combined with highly customizable visualization dashboards and alerting functions. The original Flink real‑time platform relied on open‑source Prometheus and Grafana, which limited metric utility because alerting and operational issues were not well supported. This article introduces a solution built on the company’s alerting system lala‑monitor that enables rapid metric ingestion, flexible control, quality assurance, visualization, configurability, extensibility, and isolation.

Link Evolution

Original pipeline

The original pipeline used Flink’s native Prometheus reporter to push metrics to a gateway node, from which Prometheus scraped the data and Grafana visualized it. Problems included frequent gateway memory exhaustion, blocked metric flow when memory filled, duplicate metric storage across task instances, and unnecessary label overhead.

New pipeline

The new design routes metrics through the company’s monitoring platform, solving storage, visualization, and alert configuration issues. Metrics are reported to Kafka; a consumer process on each project machine reads the Kafka stream and exposes data to lala‑monitor . This approach isolates metrics per project, supports task‑level granularity, and enables downstream reuse of Kafka data for other scenarios such as dimensionality reduction.

Overall Design

2.1 metrics‑collect module

The module gathers metrics from four sources: native Flink task metrics, Flink‑SQL job embedded metrics, user‑defined metrics from Flink JAR jobs, and additional system metrics not provided by Flink.

2.2 Kafka module

Three Kafka topics are used: custom‑collected metrics, Flink native metrics, and pre‑processed metrics.

2.3 metrics‑report module

This module consumes Kafka data, performs preprocessing and control, and exposes two kinds of metrics: business metrics from tasks and process metrics of the metrics‑report collector itself.

2.4 Metrics‑alarm / Metrics‑display module

Provides dashboards and alerting for both end‑users (task‑level views) and administrators (system‑level views).

Metrics‑collect Implementation

3.1 Kafka‑report

A custom Kafka reporter was added because Flink does not ship one. It records metrics as JSON and adds two fields: torrent_task_id (to identify task instances with minimal payload) and report_timestamp (event time for end‑to‑end latency tracking). Records are split per task+metric key to avoid large single Kafka messages and to enable fine‑grained partitioning.

3.2 Metric control

Control logic reduces unnecessary data volume, enables graceful degradation, and prevents uncontrolled metric duplication. It decides which metrics are sent, to which topic, and whether a task should report at all.

Metrics‑report Implementation

4.1 Pre‑processing

Pre‑processing discards irrelevant dimensions (e.g., keeping flink_task_name over flink_task_id) and normalizes label values to reduce label explosion. For example, long task names are replaced with fixed‑length task IDs, and source/sink operators are renamed to their table names.

4.2 Metric control

Metrics‑report repeats the control logic of metrics‑collect as a safety net, allowing dynamic enable/disable without restarting Flink jobs.

4.3 Data‑quality guarantees

Four quality levels are defined: no guarantee, no‑loss, no‑duplicate, and both no‑loss & no‑duplicate. Mechanisms such as WAL, lag markers, and TTL cleanup ensure that metrics are neither lost nor duplicated across rebalances, task restarts, or process crashes.

Dashboard and Alerting

5.1 Dashboards

Multiple dashboards are offered: user‑focused task‑level views with aggregated curves, ratio‑based metrics (e.g., memory usage rate), and time‑based conversions; and admin‑focused system‑level views with detailed process metrics.

5.2 Alert types

Three alert categories are defined—task alerts, metrics‑report alerts, and system alerts—covering retry counts, failures, GC spikes, checkpoint issues, lag, consumption recovery time, CPU utilization, and more. Alerts are sent to dynamic subscription groups based on task tags, ensuring that responsible parties receive relevant notifications.

Deployment Characteristics

metrics‑report follows a stateless, horizontally scalable architecture.

Kafka partitions can be expanded; metrics‑report automatically discovers new partitions.

Additional topics can be added for isolation and future extensions.

Outcomes

The new monitoring and alerting framework solved storage, visualization, and configuration problems. Alert latency improved from minutes to 10 seconds, enabling faster detection of task and system anomalies, better resource cost governance, remote operation tasks, and statistical reporting of alert counts and recovery times.

Future Plans

Further work includes enhancing real‑time alert accuracy, integrating metric data with logs for root‑cause analysis, and developing automated operations such as resource tuning based on metric insights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Real-TimeFlinkKafka
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.