Operations 12 min read

How NetEase Cloud Music Built Pylon APM: A Deep Dive into Tracing, Metrics, and Automated Diagnosis

This article details the design and implementation of the Pylon APM monitoring platform for NetEase Cloud Music, covering background challenges, the choice of Pinpoint, extensions to trace models, tail‑based exception sampling, Prometheus integration, automated JStack collection, and the resulting APM product features.

NetEase Cloud Music Tech Team

Nov 7, 2023

How NetEase Cloud Music Built Pylon APM: A Deep Dive into Tracing, Metrics, and Automated Diagnosis

Background

The original server‑side monitoring system at NetEase Cloud Music suffered from four major issues:

Trace integrity loss due to SDK‑based instrumentation, which dropped asynchronous context and caused conflicting trace data.

Separate trace and metric stores, making it difficult to locate slow requests or SQL queries.

Logs were not linked to request traces, especially under low‑sampling configurations.

Version upgrades required changes to the SDK in every service, slowing feature iteration.

After evaluating open‑source tracing solutions, Pinpoint was selected over SkyWalking because its trace model matched the service’s needs and its plugin development was more developer‑friendly.

Project Idea and Solution

Overall Architecture

The platform, codenamed Pylon APM , consists of two core components:

Agent : a Java Agent that generates and propagates traces, records Prometheus metrics, and injects bytecode via a plugin framework. Plugins add trace/metric capabilities for HTTP, RPC, messaging, database, and cache components.

Console : a backend that collects, stores, analyzes, and visualizes trace, metric, and log data, enabling rapid issue diagnosis.

Enhancements to the Pinpoint Java Agent

Extended Trace Data Model : Built on Pinpoint’s Trace‑Span‑SpanEvent structure, additional association and pass‑through fields were added to support multi‑downstream links, asynchronous downstream links, and callback links. The model propagates fields both intra‑process and inter‑process, with a dedicated SDK for downstream services.

Tail‑Based Exception Trace Sampling : Each service writes full traces to a temporary log. When an exception occurs, the Agent stores the trace ID in a centralized cache and tags the trace context. A background thread scans the temporary log every 30 seconds – 1 minute, extracts traces whose IDs appear in the cache, and writes them to the final collection log, guaranteeing complete back‑trace for exception paths.

Prometheus Integration : The Agent embeds a Prometheus SDK. Metrics are exposed via the standard /metrics endpoint; the server pulls them periodically, aggregates them, and stores them in VM storage. Every metric entry is tagged with the current Trace ID, enabling direct correlation between monitoring data and trace information.

Automatic JStack Collection : When a method execution exceeds a configurable latency threshold, an asynchronous listener triggers a JStack dump, saves the stack trace, and associates it with the Trace ID and method metrics. Users can also request JStack dumps manually from the UI.

Custom APM Platform Design

Trace Detail Diagnosis : Displays the full call topology from entry to downstream nodes, including latency distribution, transmitted fields, and request parameters. Users can view the call stack from a global or per‑process perspective to locate missing context fields.

Application Monitoring Dashboard : Dashboards for HTTP, RPC, messaging, database, and cache layers are built on Grafana with added features such as period‑over‑period comparison and multi‑instance analysis.

Exception‑Long‑Latency Correlation : By linking metrics with Trace IDs, users can search for abnormal traces directly from the dashboard, drill down to detailed trace pages, and quickly pinpoint problematic links.

Latency Request JStack Tracing : Automatic JStack collection shows a prompt in the UI, presents the captured stack, associated Trace context, and thread‑pool information. Manual JStack requests are also supported.

Arthas Online Diagnosis : The platform integrates high‑frequency diagnostic tools such as Arthas and JStack. Through the Agent, users can invoke these tools from the UI, collect service information, and view results in a user‑friendly format.

Conclusion

The development of Pylon APM established a methodology for online issue localization, created reusable tools for trace‑metric‑log correlation, and delivered a product that enables both junior and senior engineers to diagnose and resolve service problems efficiently. Future work will detail additional sub‑platforms such as business‑log analysis, alert governance, and scenario‑event handling.

backend APM Metrics Prometheus Tracing java-agent service monitoring

Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.