Operations 19 min read

How NetEase Cloud Music Built a Scalable Full‑Link Tracing System for Real‑Time Service Diagnosis

This article details the design, implementation, and evolution of NetEase Cloud Music's full‑link tracing platform, covering its motivations, architecture, low‑overhead data collection, multi‑dimensional analysis, service grooming, automated diagnosis, and future plans for AI‑driven anomaly detection and big‑data processing.

Yanxuan Tech Team

May 25, 2020

How NetEase Cloud Music Built a Scalable Full‑Link Tracing System for Real‑Time Service Diagnosis

Introduction

Three years ago the team started a full‑link tracing system because service fragmentation made call relationships too complex to track, logs were scattered across hosts, and diagnosing issues was painful.

The system has two main goals: a global view of service dependencies and the ability to query any request's call chain.

Chapter 1: Call Stack and Monitoring

1. Open‑source APM Comparison

We evaluated major open‑source APM solutions, found each with pros and cons, and ultimately decided to develop our own.

2. Goals – Low Overhead, Transparent, Real‑time, Multi‑dimensional

To keep impact minimal we write logs locally and let a collector ship them to Kafka, avoiding direct client uploads. We use a private protocol (with future OpenTracing support) and store traces in Elasticsearch, using Flink for real‑time aggregation.

3. System Architecture

The architecture diagram illustrates the data flow from agents to collectors, Kafka, Elasticsearch, and Flink.

4. Dimensional Analysis

We provide multi‑granularity time windows (5 s, 1 min, 5 min, 1 h) that can be combined to query arbitrary intervals, enabling efficient aggregation and low‑latency queries.

Basic monitoring, trace‑ID lookup, and visualizations of metrics such as histograms, harmonic mean, percentiles, and amplitude are provided.

Chapter 2: Service Grooming & Metric Strengthening

1. Service Grooming

We map direct and indirect service dependencies, building a complete topology that reveals each service’s position and relationships.

2. Statistical Analysis

Harmonic Mean reduces the impact of extreme values.

Histogram shows response distribution.

Scatter Plot highlights unstable links.

Percentile identifies outliers.

Amplitude describes min‑max range.

3. Service Relationship

We integrated internal systems (Sentinel, CMDB, config center, RPC metadata) to enrich each request with host, service, and SLO information.

4. Feature Highlights

Business group view (CMDB‑based).

Global search (app, link, host, trace‑ID).

Slow‑response analysis (top 200 slow requests).

Health chromatic view (blue = healthy, red = unhealthy).

These features are available on dedicated analysis pages: Application, Link, Host, and Service.

Chapter 3: Global Quality & Root‑Cause Localization

1. Goal – Automated Service Diagnosis & Efficient Fault Localization

We classify services by health: healthy, sub‑healthy, alarm, and high‑risk, based on anomaly percentages and jitter analysis.

2. Service Diagnosis

All exception logs are fully collected, de‑duplicated, and categorized by similar stack traces. Tags are applied, and a scoring mechanism ranks the most critical anomalies.

3. Global Diagnosis

We aggregate health indicators, thread‑pool alerts, throttling, jitter, and optimized anomaly topologies into a unified dashboard.

Chapter 4: Platform, Business & Data

1. Goal – Multi‑Platform Integration & Data Sharing

We provide real‑time jitter alerts and classification reports for business units.

2. Data Scale

400+ applications, 6000+ APIs.

94 billion records written daily.

170 k QPS, 4.5 TB storage.

Even with low sampling rates, the system delivers comprehensive observability.

Chapter 5: Development & Future

Future work includes AI‑driven anomaly classification, time‑series prediction, support for standard OpenTracing, broader component coverage (HTTP, HBase, thread pools), and intelligent storage that discards low‑value data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems observability Tracing service monitoring

Written by

Yanxuan Tech Team

NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.