Scaling Cloud‑Native Metric Monitoring: VictoriaMetrics, Flink, and Prometheus in Action
The article details how NetEase Cloud Music redesigned its APM stack by adopting VictoriaMetrics as a Prometheus‑compatible storage, adding Flink‑based pre‑aggregation, a query‑proxy for seamless Metric‑Trace correlation, and Grafana enhancements to achieve low‑cost, high‑performance observability at massive scale.
Background
Trace, Metrics, and Log are the three pillars of an APM system. Cloud Music’s original metric monitoring and APM were isolated, causing high correlation cost and weak overall analysis capability.
Challenges
Application‑level metric observability was weak because monitoring focused on machine‑level metrics.
Metrics could not be directly linked to traces for root‑cause analysis.
The legacy Prometheus‑style storage incurred high cost and did not scale to the required data volume.
High‑dimensional aggregation queries took seconds, slowing down troubleshooting.
Grafana and Prometheus UI lacked flexible comparative visualizations such as period‑over‑period and multi‑instance comparison.
Solution Overview
A cloud‑native monitoring system (Pylon APM) was built using VictoriaMetrics (VM) as the storage backend for Prometheus metrics. The design provides metric‑to‑trace correlation, rich multi‑dimensional Grafana dashboards, a high‑performance low‑cost data collection and storage pipeline, and millisecond‑level aggregation via a custom Recording Rules service and a query‑proxy.
Architecture
Collection chain
Exporter : Prometheus SDK embedded in each service, exposing a /metrics endpoint.
vmagent : Scrapes exporters and forwards data.
Nacos : Service‑discovery registry for exporters and vmagent.
Recording Rules : Custom Flink job that consumes raw metrics from Kafka, performs windowed pre‑aggregation, and writes aggregated series back to VM.
vminsert : Writes both raw and pre‑aggregated data to VM storage.
Query chain
Grafana : Visualizes metrics; extended to support period‑over‑period and multi‑instance comparisons.
Proxy : Custom query proxy that parses and optimizes PromQL, rewrites queries to target pre‑aggregated series, and routes them to VM.
vmselect : Retrieves data from VM storage.
Data pre‑aggregation with Flink
Metrics are continuous time‑series. vmagent writes raw data to VM and also to Kafka. The Flink Recording Rules job consumes the Kafka stream, applies tumbling windows (e.g., 1‑minute) to sum counters and compute rates, then writes the aggregated series to VM. This reduces typical aggregation query latency from seconds to milliseconds.
Query proxy for correctness
The proxy rewrites user PromQL to reference the aggregated series (e.g., cluster_gateway_call_code_total instead of gateway_call_code_total) and detects counter‑reset spikes. When a counter reset would cause an artificially large increase value, the proxy adjusts the calculation to avoid false‑positive spikes.
Flink task optimizations
Failover handling : On task restart, stored Kafka offsets are rewound by two aggregation intervals; duplicate data is deduplicated by VM storage.
Serialization improvement : Replaced generic JavaBean/Kryo serialization with native Flink Tuple types, cutting serialization overhead from 54 % to 15 % and increasing supported QPS by tens of times.
Metric‑Trace correlation
A dedicated correlation table stores metric labels, timestamps, and associated Trace IDs. In the APM UI metric values appear as clickable buttons that fetch related traces for deeper analysis.
Enhanced visualization
Grafana dashboards were extended to support:
Period‑over‑period (环比) analysis.
Multi‑instance comparison with sorting and Top‑K views.
Automatic metric‑level indicator calculations.
Results
The VM‑based solution now monitors nearly 700 million active time‑series across Cloud Music business lines. Benefits include:
Breaking information silos by linking metrics to traces.
Significant improvement in application‑level observability and P99 reporting.
Low‑cost, high‑performance storage—approximately one‑third the resources of comparable solutions such as M3DB.
Scalable Grafana visualizations for large‑scale services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
