Design and Architecture of Corona: NetEase Cloud Music Multi‑Platform Front‑End Monitoring System
Corona is NetEase Cloud Music’s unified, cross‑platform front‑end monitoring system that ingests logs from Web, React Native, Node.js, Android, iOS, Flutter and Windows CEF, enriches them, routes them through real‑time anomaly and performance pipelines, stores them in HBase, and offers customizable alerts, de‑obfuscation, AI‑assisted analysis, and extensible reporting to ensure rapid fault detection and remediation across the organization.
NetEase Cloud Music’s large‑front‑end monitoring product (code‑named Corona ) supports Web, React Native, Node.js, Flutter, Android, iOS and Windows CEF applications. It has been integrated into dozens of business units across NetEase, providing anomaly detection, performance monitoring, issue tracing and real‑time alerts.
Background
The Corona anomaly‑monitoring component started development in early 2020. At that time, different teams within the front‑end organization used a variety of open‑source or commercial monitoring solutions.
The front‑end team deployed a private instance of Sentry ; the Android team used NetEase’s internal “cloud‑catch” and Tencent’s bugly ; the iOS team relied on Google’s Firebase Crashlytics ; the Windows‑CEF team had no platform‑level product and used breakpad to parse crash logs.
These fragmented solutions caused several problems:
Many logs are not true bugs (e.g., third‑party script injection, network failures on weak connections) and therefore pollute the developers’ debugging workflow.
Alerting mechanisms differ across products, making it impossible to define a unified alert‑closure strategy.
Process‑level closure points become bottlenecks because the monitoring capabilities are uneven across teams.
Query and analysis capabilities are missing; developers cannot obtain precise event counts, affected‑user metrics, or dimensional distributions.
Custom reporting is weak; business‑specific runtime states cannot be reported with trend or alerting support.
Complex issues require cross‑platform context (business‑level events, network logs) that existing tools do not provide.
Business‑level quality and experience metrics need multi‑platform log aggregation for big‑data analysis.
When the Cloud Music team began adopting React Native, role boundaries blurred and the fragmented monitoring stack hindered collaboration and process convergence. Hence a unified, cross‑role, cross‑platform monitoring product owned by the team was conceived.
Non‑Functional Requirements
Corona must deliver rapid anomaly discovery, fast issue localization and assist in fault remediation while remaining highly customizable for business‑specific data. The non‑functional goals are:
High real‑time performance – the value of information decays quickly.
Full data capture – no anomaly log should be dropped.
High availability – the monitoring system must be more stable than the applications it observes.
High throughput – capable of processing massive log volumes.
Fault tolerance – the SDK and services must not impact business operation.
Extensibility – plug‑in and replaceable service nodes.
Architecture Overview
The system is composed of several logical components, identified in the diagram as ① – ⑥ :
① Log ingestion, reception and collection. The log‑ingestion service integrates with internal middle‑platform services to enrich logs with business attributes (e.g., user‑id derived from cookies). The node is replaceable to allow other business units to inject additional attributes.
② A traffic‑splitting service added after several high‑traffic incidents. It classifies logs into anomaly, performance and traffic streams and routes them to dedicated consumer services, ensuring isolation and cost‑effective resource allocation.
③ Anomaly‑log consumer. Implemented as an almost‑real‑time batch task that processes logs in 30‑second windows or 3 000‑record batches (configurable). This design reduces database write TPS and improves stability under traffic spikes.
④ Centralized log‑filtering. Provides emergency throttling during traffic spikes and allows users to discard noisy data. Filtering can be performed early in the pipeline to protect downstream services.
⑤ Asynchronous auxiliary tasks (e.g., alert detection, error‑type analysis, de‑obfuscation) run as scheduled jobs, keeping the main processing path lightweight.
⑥ Raw logs are persisted to HBase as a data‑asset backup, enabling self‑service analysis and downstream user‑journey analytics.
The core differentiators from existing open‑source or commercial solutions are:
Enhanced storage layer with multiple engines and redundancy to support complex application‑level features.
Lightweight data collection with sophisticated routing, independent consumption pipelines, filtering and feature extraction, allowing new features to be added without SDK upgrades.
Cross‑Platform Log Protocol
Corona defines an extensible log protocol that all SDKs (Web, React Native, Node.js, Android, iOS, etc.) follow. The protocol separates three logical parts:
Exception object : core unit containing exception type, description and stack trace. Supports precise and fuzzy search.
Feature dimensions : environment information (timestamp, device model, app version, OS) plus stack‑specific attributes such as React Native bundle version, Android root status, Node.js version, etc. Enables precise search and distribution statistics.
Contextual extension data : optional data like user actions before a Web exception, system metrics for Android, or system logs for iOS. This data is only displayed; no fuzzy search is provided.
The protocol’s extensibility, combined with a polymorphic software architecture, ensures consistent user experience while allowing platform‑specific implementations.
Source‑map based stack de‑obfuscation works as follows:
Web: SourceMap files are uploaded to CDN; the SDK captures the stack, the consumer downloads the original JS and its SourceMap, then resolves the original source line.
Node.js: The SDK parses the stack, reads the source file directly, extracts the relevant snippet and attaches it to the log.
React Native: Similar to Web, the consumer downloads the bundle and resolves the source snippet.
Feature Design
Corona provides ten major functional modules:
Real‑time quality metrics and trends (≈1 minute latency).
Precise exception aggregation with automatic trend, time‑range, version‑range and affected‑user calculations.
Centralized data de‑noise via custom filter rules applied at the log‑level without redeploying applications.
Feature extraction and distribution statistics to surface common characteristics among large log volumes.
Precise search across multiple feature dimensions and fuzzy stack search.
Stack parsing with source‑map, de‑obfuscation and symbolization support.
Multi‑dimensional, multi‑channel alerting (threshold, surge, baseline models) with email, popo and SMS channels, plus high‑frequency and gradient suppression.
Custom reporting for business‑specific metrics and auxiliary event tracking.
Issue workflow integration (GitLab‑style issue assignment, markdown comments) and optional ticket‑system linkage.
AI‑assisted analysis (e.g., ChatGPT integration) for faster stack interpretation and information retrieval.
Conclusion
Corona has been in production for three years, continuously evolving its feature set and serving multiple NetEase business units. This article offers a high‑level overview of the product; future posts will dive into individual components.
NetEase Cloud Music Tech Team
Official account of NetEase Cloud Music Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.