How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices
This article details Yanxuan's four‑year evolution of a unified monitoring, alerting, and event‑bus platform for micro‑service architectures, covering design principles, technology selection, multi‑stage implementation, dynamic sampling, custom plugins, data modeling, visualization upgrades, and the final fault‑driven, system‑wide integration.
Introduction
Effective monitoring and alerting are essential for service governance in micro‑service architectures. The platform must provide perception, prediction, and emergency handling of service metrics, support cross‑domain collaboration, and avoid siloed tools that increase operational cost.
Monitoring Standards
Unified Monitoring Platform – a single system that aggregates metrics from all services and eliminates tool‑specific silos.
Service Performance Metric Measurement – capture latency of every component interaction (disk I/O, RPC, API calls) to assess user‑experience impact.
Metric 2.0 Event Standard – enrich metrics with context (tags, multiple fields) and encode them as CloudEvent for seamless cross‑platform correlation.
Secondary Mining of Monitoring Data – store raw time‑series data for risk‑prediction models and scenario replay.
Full‑Link Monitoring Platform Architecture
The platform addresses three core challenges: mapping service call relationships, collecting fine‑grained performance data, and rapid fault detection. After evaluating several open‑source solutions, Pinpoint was selected as the foundation because of its low intrusiveness, extensibility, and strong community support.
Collection Layer
Agent (JavaAgent) – bytecode enhancement for low‑overhead data capture.
SDK (hard‑coded probes) – high‑precision instrumentation where needed.
Log‑based collection – non‑intrusive, requires standardized log output.
Analysis Layer
Real‑time parsing of client‑reported data, applying business formulas (count, sum) to produce minute‑level metrics. Supports stream processing frameworks such as Flink or Spark.
Storage Layer
Massive time‑series data is stored in NTSDB (an InfluxDB‑based TSDB) which offers schema‑less multi‑dimensional tags, multiple fields, and low‑latency aggregation—overcoming the limitations of the original HBase storage.
Visualization Layer
Interactive dashboards display real‑time metrics with support for comparative views (YoY, MoM). Four primary performance views (TPS, RT, CPU, Load) replace the generic ServerMap, and an independent application view separates cluster and single‑node perspectives for faster issue localization.
Alert Layer
Both fixed‑threshold and dynamic‑threshold detection are provided. Dynamic thresholds use historical data to predict anomalies, aiming to minimise false positives and missed alerts.
Platform Evolution
Phase 1 – Introduction & Promotion
Added a routing layer between the JVM and Pinpoint agent; version routing is controlled via the ApolloY configuration centre, allowing agent upgrades without JVM restarts.
Implemented dynamic sampling rate adjustment in the Caesar platform, enabling on‑the‑fly changes without service restarts.
Integrated with internal authentication (NetEase OpenID), alert delivery channels (POP‑O, Yixin, email, SMS), and the global deployment system for agent distribution.
Developed custom plugins: DDB (distributed database), WZP (private TCP protocol), and FastJson support.
Phase 2 – Full‑Link Completion
Refined Pinpoint’s Trace and Metric models into a multi‑dimensional Metric 2.0 schema stored in NTSDB, enabling flexible tag‑based queries and real‑time aggregation.
Extended tracing to client, mobile, and edge components, achieving end‑to‑end latency visibility.
Introduced a large‑front‑end dashboard exposing TPS, RT, CPU, Load and custom metrics such as startup indicators, H5 popup errors, and anti‑hijack monitoring.
Phase 3 – Fine‑Grained Construction
Decoupled sampling for key metrics: trace IDs are generated for all requests, while interface, cache, and SQL metrics are collected independently of Pinpoint sampling.
Replaced the agent’s BlockingQueue with the lock‑free Disruptor framework and added optional Kafka reporting to improve data‑ingestion throughput.
Created precise second‑level dashboards (performance, error, cache, dependency) based on unsampled data.
Implemented a problem‑localisation module that correlates trace and SQL information to pinpoint root causes.
Phase 4 – Systematic Fusion
Built a fault‑driven alert convergence mechanism that follows the principle “comprehensive monitoring → rapid response → damage control → timely repair → unified management → hierarchical handling”.
Introduced an event‑bus based on the CloudEvent standard to standardise data exchange across platforms, enabling alarm convergence and collaborative fault handling.
Added APM proactive risk identification (pre‑configured risk rules) and online diagnosis support via Arthas‑based white‑box debugging.
Conclusion
The full‑link monitoring platform demonstrates how a low‑intrusive, real‑time, and extensible architecture can evolve from basic capability coverage to a systematic, fault‑driven ecosystem. By standardising metrics with CloudEvent, decoupling sampling, and unifying alert handling, the platform provides robust service continuity while remaining adaptable to future business requirements.
NetEase Yanxuan Technology Product Team
The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
