Building Cloud Music's APM Metric Monitoring System Based on VictoriaMetrics
Cloud Music’s middleware team built the Pylon APM monitoring system on VictoriaMetrics, combining exporters, vmagent, Nacos, Flink‑based pre‑aggregation recording rules and vminsert for collection with Grafana, a custom Proxy and vmselect for querying, achieving millisecond‑level latency, metric‑trace correlation, stability improvements, and cost‑effective storage for nearly 700 million active time series.
This article details how Cloud Music's middleware team built a new application monitoring system (Pylon APM) based on VictoriaMetrics. The system addresses several key challenges: weak application-layer metric observability, difficulty correlating metrics with traces, performance and cost issues with the old monitoring system, heavy aggregation query burden, and limited visualization capabilities.
The architecture consists of collection and query pipelines. The collection pipeline includes: Exporter (embedded in business services), vmagent (data collection), Nacos (service discovery), Recording Rules (Flink-based streaming pre-aggregation), and vminsert. The query pipeline includes: Grafana (visualization with custom development for YoY and multi-instance comparison), self-developed Proxy (query optimization), and vmselect.
To solve high-dimensional aggregation query performance issues, the team implemented a pre-aggregation Recording Rules service using Flink. The solution writes raw data to storage while simultaneously writing to Kafka for Flink to consume, aggregate via tumbling windows, and write back to vmstorage. This reduced query latency from seconds to milliseconds.
The article also covers: Flink task stability (handling restarts by recording Kafka offsets and resetting position), serialization optimization (converting from Kryo to Tuple types, reducing serialization overhead from 54% to 15%), solving Counter reset issues causing increase function spikes via Proxy-based data correctness detection, and implementing Metric-Trace correlation through a dedicated association table.
The system now supports nearly 700 million active time series, providing Metric-Trace correlation for troubleshooting, enhanced application-layer monitoring, low-cost Grafana visualization, and cost-effective large-scale time-series data storage.
NetEase Cloud Music Tech Team
Official account of NetEase Cloud Music Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.