Mobile Development 14 min read

Client-Side APM Monitoring System Implementation for NetEase Cloud Music

The article describes NetEase Cloud Music’s custom client‑side APM system that combats sliding stutter, heating, UI freezes and crashes by employing binary‑tree stack aggregation to halve storage, window‑based CPU analysis, run‑loop jank detection, ping‑based ANR monitoring, and malloc‑logger memory tracking with automated dump thresholds.

NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Client-Side APM Monitoring System Implementation for NetEase Cloud Music

This article introduces a self-developed APM (Application Performance Monitoring) system for NetEase Cloud Music client applications. The system addresses critical quality issues including sliding stuttering, device heating, UI freezing, and unexpected crashes that severely impact user experience.

Stack Aggregation Technology: The system implements an aggregated stack solution using binary tree data structure, inspired by Apple's ips file format. This approach reduces storage space by over 50% by leveraging the characteristic that stack bottom calls change far less frequently than stack top calls. The implementation includes algorithms for matching stack frames at different depths, identifying key stacks with highest hit rates, and filtering无效 system-only stacks.

CPU Monitoring: Uses a window scanning mechanism to detect long-duration high CPU usage rather than short spikes. Issues are classified into info/warn/error levels based on average CPU usage. Each issue corresponds to one thread with thread name reporting.

Jank Detection: Monitors Main Runloop using a dedicated background thread that polls for execution time exceeding 50ms (3 frames). Implements frequency control to capture stacks at 1st, 3rd, 5th, 10th, 15th, 20th... jank events.

ANR Monitoring: Uses a ping mechanism that sends tasks to main_queue and checks if ack values are modified to detect UI thread unresponsiveness. Captures full thread stacks at 4th, 8th, 16th seconds and main thread stacks at 2nd, 3rd, 4th, 5th, 6th... seconds of ANR.

Memory Monitoring: Leverages system's malloc_logger callback to capture memory allocation stacks. Triggers dump when app memory exceeds 500MB, and again for every additional 300MB increase. Monitors OOM, large memory objects, and massive small memory object allocations.

The code demonstrates key implementation details:

// Getting main function address
struct uuid_command * cmd = (struct uuid_command *)macho_search_command(image, LC_MAIN);
if (cmd != NULL) {
struct entry_point_command * entry_pt = (struct entry_point_command *)cmd;
Dl_info info = {0};
dladdr((const void *)header, &info);
main_func_addr = (void *)(info.dli_saddr + entry_pt->entryoff);
}
Performance OptimizationAPMANR detectionCPU monitoringiOS monitoringOOM monitoringstack aggregation
NetEase Cloud Music Tech Team
Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.