Operations 16 min read

Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture

The 58 Intelligent Monitoring System provides a flexible, 24/7, multi‑dimensional monitoring solution that covers network, server, system, application and business layers, incorporates AI‑driven prediction, anomaly detection, alarm merging, root‑cause analysis and self‑healing, and offers both PC and WeChat interfaces for operators.

58 Tech
58 Tech
58 Tech
Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture

The 58 Intelligent Monitoring System aims to deliver a flexible, easy‑to‑use monitoring product for all business lines of the group, achieving 7×24 real‑time monitoring without blind spots by covering network, server, system, application, and business layers.

Core Functions

Data collection (e.g., server resource usage, service status)

Configurable alarm policies

Accurate, low‑volume alarm delivery via multiple channels

Multi‑dimensional data visualization

The system acts as the guardian of online services, helping operations, development, and testing teams quickly detect and troubleshoot faults, visualize operational data, and provide intelligent insights such as alarm correlation, root‑cause analysis, and optimization suggestions.

Three‑Dimensional Monitoring Architecture

Vertical coverage includes:

Network layer – device status, bandwidth, QoS, etc.

Server layer – downtime, login failures, hardware faults

System layer – CPU, memory, disk, network usage

Application layer – port/process status, QPS

Business layer – PV, UV, order volume, revenue

Horizontal coverage includes:

User side – page performance, DNS hijacking, errors, timeouts

Data‑center network exit – VIP connectivity, page and interface monitoring

Traffic ingress – total traffic and per‑client (APP, mobile, PC) traffic, Nginx‑level metrics

Business cluster – single‑machine and cluster‑level monitoring of availability and response time

Cluster‑Based Monitoring Model

Nodes providing the same service are grouped into a cluster; all monitoring configuration (node list, templates, alarm recipients) is associated with the cluster, enabling easy scaling, node removal, and alarm rule updates without touching other settings.

User Experience

The PC UI consists of three areas: menu, service tree, and business display. Selecting a node in the service tree defines the scope of data shown in the display area.

A lightweight WeChat version provides alarm details, metric views, alarm silencing, and progress remarks for collaborative handling.

Multi‑Dimensional Monitoring Methods

Basic monitoring – server downtime, resource usage, network quality

Service monitoring – port and process status

Custom monitoring – user‑defined metrics

Functional monitoring – page and interface checks

Availability monitoring – cluster and domain level availability, response time

Business‑level intelligent monitoring – prediction and anomaly detection of key business metrics

Implementation details:

Data is collected by agents on each server, stored, and evaluated for anomalies before being visualized.

Page and interface monitoring validates DNS resolution, connectivity, HTTP status, response time, content length, and specific keywords.

Cluster‑level probing detects server‑level issues even when Nginx retries mask them from end users.

Cluster and domain availability are derived from real‑time Nginx log aggregation using a Storm cluster.

Intelligent Monitoring – Machine‑Learning Workflow

The workflow follows four steps: problem definition, data processing, model training, and model deployment. Regression models predict daily traffic trends; classification models detect real‑time anomalies.

The prediction results closely match actual data, while anomaly detection classifies anomalies into normal, severe, and abrupt categories, enabling differentiated alarm channels (voice for abrupt, SMS/WeChat for severe).

Smart Alarm Merging

To avoid alarm flooding, alarms are merged within a 1‑minute window based on user, status, channel, and dimension (cluster, IP, subnet, exception type, host/VM relationship). A custom Gini‑value‑based algorithm iteratively selects merging dimensions and partitions the dataset until a stop condition is met.

Post‑merge, alarm volume decreased by 76.65% while preserving high merge quality, providing concise aggregated information for rapid decision‑making.

Smart Alarm Correlation Analysis

Correlation analysis uses Pearson coefficients to compute relationships among large numbers of metrics, automatically presenting root‑cause analysis and visualized correlation graphs in WeChat alerts.

Traditional vs. Intelligent Monitoring

Traditional monitoring relies on static thresholds and manual analysis, whereas intelligent monitoring adds automation, three‑dimensional coverage, productization for better UX, and AI‑driven features such as prediction, anomaly detection, alarm merging, and self‑healing.

Summary

The system evolved through four stages: automation (auto‑sync from CMDB, template binding), three‑dimensional coverage (vertical and horizontal layers), productization (enhanced UI for internal users), and intelligence (integration of AI techniques to achieve predictive, self‑healing monitoring).

monitoringSystem ArchitectureMachine LearningAutomationoperationsalertingIntelligent Monitoring
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.