Operations 23 min read

How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices

This article details Yanxuan's four‑year evolution of a unified monitoring, alerting, and event‑bus platform for micro‑service architectures, covering design principles, technology selection, multi‑stage implementation, dynamic sampling, custom plugins, data modeling, visualization upgrades, and the final fault‑driven, system‑wide integration.

NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices

Introduction

Effective monitoring and alerting are essential for service governance in micro‑service architectures. The platform must provide perception, prediction, and emergency handling of service metrics, support cross‑domain collaboration, and avoid siloed tools that increase operational cost.

Monitoring Standards

Unified Monitoring Platform – a single system that aggregates metrics from all services and eliminates tool‑specific silos.

Service Performance Metric Measurement – capture latency of every component interaction (disk I/O, RPC, API calls) to assess user‑experience impact.

Metric 2.0 Event Standard – enrich metrics with context (tags, multiple fields) and encode them as CloudEvent for seamless cross‑platform correlation.

Secondary Mining of Monitoring Data – store raw time‑series data for risk‑prediction models and scenario replay.

Full‑Link Monitoring Platform Architecture

The platform addresses three core challenges: mapping service call relationships, collecting fine‑grained performance data, and rapid fault detection. After evaluating several open‑source solutions, Pinpoint was selected as the foundation because of its low intrusiveness, extensibility, and strong community support.

Collection Layer

Agent (JavaAgent) – bytecode enhancement for low‑overhead data capture.

SDK (hard‑coded probes) – high‑precision instrumentation where needed.

Log‑based collection – non‑intrusive, requires standardized log output.

Analysis Layer

Real‑time parsing of client‑reported data, applying business formulas (count, sum) to produce minute‑level metrics. Supports stream processing frameworks such as Flink or Spark.

Storage Layer

Massive time‑series data is stored in NTSDB (an InfluxDB‑based TSDB) which offers schema‑less multi‑dimensional tags, multiple fields, and low‑latency aggregation—overcoming the limitations of the original HBase storage.

Visualization Layer

Interactive dashboards display real‑time metrics with support for comparative views (YoY, MoM). Four primary performance views (TPS, RT, CPU, Load) replace the generic ServerMap, and an independent application view separates cluster and single‑node perspectives for faster issue localization.

Alert Layer

Both fixed‑threshold and dynamic‑threshold detection are provided. Dynamic thresholds use historical data to predict anomalies, aiming to minimise false positives and missed alerts.

Platform Evolution

Phase 1 – Introduction & Promotion

Added a routing layer between the JVM and Pinpoint agent; version routing is controlled via the ApolloY configuration centre, allowing agent upgrades without JVM restarts.

Implemented dynamic sampling rate adjustment in the Caesar platform, enabling on‑the‑fly changes without service restarts.

Integrated with internal authentication (NetEase OpenID), alert delivery channels (POP‑O, Yixin, email, SMS), and the global deployment system for agent distribution.

Developed custom plugins: DDB (distributed database), WZP (private TCP protocol), and FastJson support.

Phase 2 – Full‑Link Completion

Refined Pinpoint’s Trace and Metric models into a multi‑dimensional Metric 2.0 schema stored in NTSDB, enabling flexible tag‑based queries and real‑time aggregation.

Extended tracing to client, mobile, and edge components, achieving end‑to‑end latency visibility.

Introduced a large‑front‑end dashboard exposing TPS, RT, CPU, Load and custom metrics such as startup indicators, H5 popup errors, and anti‑hijack monitoring.

Phase 3 – Fine‑Grained Construction

Decoupled sampling for key metrics: trace IDs are generated for all requests, while interface, cache, and SQL metrics are collected independently of Pinpoint sampling.

Replaced the agent’s BlockingQueue with the lock‑free Disruptor framework and added optional Kafka reporting to improve data‑ingestion throughput.

Created precise second‑level dashboards (performance, error, cache, dependency) based on unsampled data.

Implemented a problem‑localisation module that correlates trace and SQL information to pinpoint root causes.

Phase 4 – Systematic Fusion

Built a fault‑driven alert convergence mechanism that follows the principle “comprehensive monitoring → rapid response → damage control → timely repair → unified management → hierarchical handling”.

Introduced an event‑bus based on the CloudEvent standard to standardise data exchange across platforms, enabling alarm convergence and collaborative fault handling.

Added APM proactive risk identification (pre‑configured risk rules) and online diagnosis support via Arthas‑based white‑box debugging.

Conclusion

The full‑link monitoring platform demonstrates how a low‑intrusive, real‑time, and extensible architecture can evolve from basic capability coverage to a systematic, fault‑driven ecosystem. By standardising metrics with CloudEvent, decoupling sampling, and unifying alert handling, the platform provides robust service continuity while remaining adaptable to future business requirements.

Monitoring architecture overview
Monitoring architecture overview
Full‑link monitoring concept
Full‑link monitoring concept
Routing layer diagram
Routing layer diagram
Dynamic sampling UI
Dynamic sampling UI
Metric model diagram
Metric model diagram
End‑to‑end trace diagram
End‑to‑end trace diagram
Custom metric dashboard
Custom metric dashboard
Performance view redesign
Performance view redesign
Application view
Application view
Problem localisation diagram
Problem localisation diagram
Risk identification diagram
Risk identification diagram
Event‑bus architecture
Event‑bus architecture
MonitoringmicroservicesoperationsobservabilityalertingFull‑Link Tracingevent bus
NetEase Yanxuan Technology Product Team
Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.