Operations 11 min read

How NetEase Cloud Communication Builds a Real-Time Service Monitoring Platform

NetEase Cloud Communication’s service monitoring platform leverages data collection, preprocessing, alerting, and visualization pipelines—using HTTP APIs, Kafka, custom scripts, and NTSDB—to provide real-time insights, ensure stability, and support scalable, high‑throughput audio‑video services.

NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
How NetEase Cloud Communication Builds a Real-Time Service Monitoring Platform

Data is essential for many businesses, and NetEase Cloud Communication uses data to improve services and drive continuous growth. Their service monitoring platform acts like a dashboard for a high‑performance car, showing speed, fuel, and RPM to help decide when to accelerate or brake, thereby ensuring stable, reliable communication services.

System Architecture

The platform processes audio‑video data primarily from client and server logs. The overall data collection chain is critical for data validity and timeliness.

Data Collection

Data enters the platform via two channels: an HTTP API for near‑real‑time reporting and Kafka for high‑throughput scenarios. A preprocessing module filters illegal data and performs early splitting before forwarding to the processing service.

Data Processing

After collection, data passes through task scheduling, consumer queues, and processing units. The platform supports generic rules (JSON conversion, field extraction) for 80% of cases and custom scripts for complex calculations such as multi‑field correlation, regex, and stream joins. Dimension tables handle high‑volume, high‑concurrency scenarios using local and third‑party caches. Processed data is written to NTSDB, a clustered time‑series database based on InfluxDB, offering high availability, compression, and concurrency.

Monitoring & Alerting

The alerting stage consists of a metric aggregation module and an alert module. Aggregation supports field grouping, flexible windows, filtering, and a rich set of operators (sum, min/max, first/last, avg, count, distinct, TP90/TP95/TP99, ring‑ratio, std‑dev) as well as composite metrics. To mitigate hotspot skew, a two‑stage aggregation is used: a pre‑processing shuffle followed by final aggregation.

Alerts are decoupled from aggregation, performing rule validation, rate limiting, message packaging, and delivery via internal IM, SMS, or phone channels.

Data Applications

Processed data feeds several downstream platforms:

Visualization : Grafana visualizes metrics stored in NTSDB; custom plugins extend Grafana for specialized dashboards.

Quality Service Platform : Provides real‑time problem‑locating tools for customers.

ELK Log Platform : Uses Logstash, Elasticsearch, and Kibana for detailed log search.

Online/Offline Analysis : Kafka streams data to Flink for slicing and archiving, then syncs to an offline warehouse for further mining.

Conclusion

Since its launch in early 2020, the platform has grown from a dozen collection tasks to over 300, handling 100+ key user behaviors and system events, 300+ core audio‑video metrics, processing millions of rows daily and terabytes of data. Continuous growth in concurrency and throughput drives higher demands for stability and scalability, which the platform strives to meet while delivering higher‑quality services to customers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data pipelineOperationscloud communicationservice monitoring
NetEase Smart Enterprise Tech+
Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.