How NetEase Cloud Music Built a Scalable Full‑Link Tracing System for Real‑Time Service Diagnosis
This article details the design, implementation, and evolution of NetEase Cloud Music's full‑link tracing platform, covering its motivations, architecture, low‑overhead data collection, multi‑dimensional analysis, service grooming, automated diagnosis, and future plans for AI‑driven anomaly detection and big‑data processing.
Introduction
Three years ago the team started a full‑link tracing system because service fragmentation made call relationships too complex to track, logs were scattered across hosts, and diagnosing issues was painful.
The system has two main goals: a global view of service dependencies and the ability to query any request's call chain.
Chapter 1: Call Stack and Monitoring
1. Open‑source APM Comparison
We evaluated major open‑source APM solutions, found each with pros and cons, and ultimately decided to develop our own.
2. Goals – Low Overhead, Transparent, Real‑time, Multi‑dimensional
To keep impact minimal we write logs locally and let a collector ship them to Kafka, avoiding direct client uploads. We use a private protocol (with future OpenTracing support) and store traces in Elasticsearch, using Flink for real‑time aggregation.
3. System Architecture
The architecture diagram illustrates the data flow from agents to collectors, Kafka, Elasticsearch, and Flink.
4. Dimensional Analysis
We provide multi‑granularity time windows (5 s, 1 min, 5 min, 1 h) that can be combined to query arbitrary intervals, enabling efficient aggregation and low‑latency queries.
Basic monitoring, trace‑ID lookup, and visualizations of metrics such as histograms, harmonic mean, percentiles, and amplitude are provided.
Chapter 2: Service Grooming & Metric Strengthening
1. Service Grooming
We map direct and indirect service dependencies, building a complete topology that reveals each service’s position and relationships.
2. Statistical Analysis
Harmonic Mean reduces the impact of extreme values.
Histogram shows response distribution.
Scatter Plot highlights unstable links.
Percentile identifies outliers.
Amplitude describes min‑max range.
3. Service Relationship
We integrated internal systems (Sentinel, CMDB, config center, RPC metadata) to enrich each request with host, service, and SLO information.
4. Feature Highlights
Business group view (CMDB‑based).
Global search (app, link, host, trace‑ID).
Slow‑response analysis (top 200 slow requests).
Health chromatic view (blue = healthy, red = unhealthy).
These features are available on dedicated analysis pages: Application, Link, Host, and Service.
Chapter 3: Global Quality & Root‑Cause Localization
1. Goal – Automated Service Diagnosis & Efficient Fault Localization
We classify services by health: healthy, sub‑healthy, alarm, and high‑risk, based on anomaly percentages and jitter analysis.
2. Service Diagnosis
All exception logs are fully collected, de‑duplicated, and categorized by similar stack traces. Tags are applied, and a scoring mechanism ranks the most critical anomalies.
3. Global Diagnosis
We aggregate health indicators, thread‑pool alerts, throttling, jitter, and optimized anomaly topologies into a unified dashboard.
Chapter 4: Platform, Business & Data
1. Goal – Multi‑Platform Integration & Data Sharing
We provide real‑time jitter alerts and classification reports for business units.
2. Data Scale
400+ applications, 6000+ APIs.
94 billion records written daily.
170 k QPS, 4.5 TB storage.
Even with low sampling rates, the system delivers comprehensive observability.
Chapter 5: Development & Future
Future work includes AI‑driven anomaly classification, time‑series prediction, support for standard OpenTracing, broader component coverage (HTTP, HBase, thread pools), and intelligent storage that discards low‑value data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Yanxuan Tech Team
NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
