Operations 20 min read

How 58’s Intelligent Monitoring System Guarantees 24/7 Service Stability

This article details the design, architecture, and AI‑driven features of 58’s intelligent monitoring platform, explaining how multi‑dimensional data collection, predictive analytics, and smart alarm merging ensure continuous, automated observability across network, server, application, and business layers.

Efficient Ops
Efficient Ops
Efficient Ops
How 58’s Intelligent Monitoring System Guarantees 24/7 Service Stability

Monitoring System Overview

The 58 Intelligent Monitoring System aims to provide flexible, easy‑to‑use monitoring products for all business lines, covering network, server, system, application, and business layers to achieve 24/7 real‑time monitoring and ensure stable operation of all products.

Beyond traditional data collection, storage, alerting, and visualization, it also supports intelligent prediction of key metrics, anomaly detection, alarm merging, alarm correlation analysis, self‑healing, fault warning, automated monitor addition, and customizable monitoring.

Core Functions of the Monitoring System

Websites can encounter various access anomalies; maintaining service stability relies on an intelligent monitoring system.

Data collection: gather metric data such as server resource usage and service status.

Alarm strategy configuration: flexible alarm policies.

Alarm delivery: multiple accurate, low‑volume alarm channels.

Data visualization: multi‑dimensional monitoring data view.

The system acts as a guardian for online services, helping operations, development, and testing teams quickly detect and troubleshoot faults, while quantifying and visualizing operational data for optimization.

It also incorporates intelligence to provide valuable conclusions, such as alarm correlation, root‑cause analysis, and automated optimization suggestions.

Three‑Dimensional Monitoring Architecture

Based on a typical large‑scale website architecture, a three‑dimensional monitoring system is built.

Vertical coverage : 1. Network layer – device failures, resource usage, traffic, QoS, dedicated lines. 2. Server layer – downtime, login failures, hardware faults. 3. System layer – CPU, memory, disk, network usage. 4. Application layer – port, process, API status, QPS. 5. Business layer – PV, UV, order volume, revenue.

Horizontal coverage : 1. User side – key page metrics, DNS hijack, page errors, timeouts. 2. Data‑center edge – VIP connectivity, page, API monitoring. 3. Traffic ingress – total network traffic, APP/M/PC traffic, domain‑level and cluster‑level stats from Nginx. 4. Business cluster – single‑machine monitoring (vertical) and cluster monitoring (page, API, Nginx logs, availability, response time).

Monitoring Business Model

Because internet companies often manage tens of thousands of servers, a cluster‑based monitoring model is used. A group of nodes providing the same service shares a common monitoring configuration (node list, template, alarm recipients).

This model allows easy updates: adding or removing nodes for scaling or fault isolation requires only changes to the node list; alarm policies can be modified independently; user subscription changes affect only the alarm group.

Improving User Experience

The PC version UI is divided into three areas: menu, service tree, and business display. The menu selects functions, the service tree defines the business scope, and the display area shows corresponding data and features.

The WeChat version provides mobile access, showing alarm details, related metric views, and allowing users to mute, comment, or track alarm handling progress.

Multi‑Dimensional Monitoring Methods

To detect anomalies across all dimensions, the system adopts six monitoring layers:

Basic monitoring – server downtime, resource usage, network quality.

Service monitoring – port and process status.

Custom monitoring – user‑defined metrics.

Functional monitoring – page and API checks.

Availability monitoring – cluster and domain availability, response time.

Intelligent business metric monitoring – predictive and anomaly detection for macro business data.

Basic, Service, and Custom Monitoring

These data types are collected by agents deployed on servers, then stored, evaluated for anomalies, visualized, and alerted.

Page and API Monitoring

Key pages (home, list, detail) and API endpoints directly affect user experience. The system probes VIP‑resolved domains from external networks, checks DNS resolution, connection, HTTP status, response time, content length, and keyword presence for pages, and validates return codes and field lengths for APIs.

Because services are often deployed in clusters with retry mechanisms, the system also performs server‑level probing to catch issues that may not be visible to end users.

Cluster and Domain Availability Monitoring

All traffic passes through load‑balancers and Nginx clusters before reaching backend services. Real‑time Nginx logs are streamed to a Storm cluster, which computes status‑code distribution, response times, and other metrics for both cluster and domain dimensions, enabling fault pre‑warning.

Intelligent Business Metric Monitoring

Macro business indicators (e.g., traffic, orders, revenue) are processed with machine‑learning techniques for forecasting and anomaly detection, as described later in the intelligent monitoring section.

Summary of Multi‑Dimensional Monitoring

Overall Architecture

The foundation is built on open‑falcon, forming the basic monitoring system.

Extensive optimizations and upgrades (highlighted in yellow) add numerous intelligent‑monitoring modules (highlighted in red).

Intelligent Monitoring Practices

Overall Intelligent Monitoring Plan

The goal is full‑process coverage of monitoring business.

Key capabilities include fault pre‑warning, graded alarms, alarm merging, root‑cause analysis, and automated self‑healing.

Intelligent Prediction and Anomaly Detection for Key Metrics

Macro business metrics such as data‑center traffic, visit volume, order count, and revenue are forecasted and anomaly‑detected using machine‑learning models.

Requirements: strong periodic patterns with short‑term fluctuations unsuitable for static thresholds; applicable to traffic, cluster, domain visits, and macro business data; need daily forecasts and real‑time anomaly detection; solution uses regression for forecasting and classification for anomaly detection.

Applying Machine Learning Methods

The workflow consists of four steps: problem definition, data processing, model training, and model deployment, integrated with business monitoring.

Traffic Prediction and Anomaly Detection Framework

The framework separates offline and online components.

Prediction results closely match actual data, demonstrating good accuracy.

Anomaly detection categorizes anomalies into normal, severe, and abrupt spikes, with visual examples.

Severe anomalies, such as bandwidth attacks, trigger high‑priority alerts via SMS or WeChat, while minor spikes generate low‑priority notifications.

The model works across different data scales, patterns, and business domains.

Intelligent Alarm Merging

To avoid alarm overload, the system merges alarms within a 1‑minute window based on user, status, and channel, and aggregates by cluster, IP, subnet, anomaly type, and host‑virtual relationships using a Gini‑based algorithm.

Algorithm steps: enumerate candidate dimensions, select merging dimension, partition dataset, repeat until stop condition.

Effect of Intelligent Alarm Merging

Post‑deployment, alarm volume dropped by 76.65% while maintaining high merging quality, providing concise aggregated information for rapid decision‑making.

Intelligent Alarm Correlation Analysis

Complex service dependencies make fault isolation difficult; correlation analysis leverages temporal proximity, service call relationships, and change events.

Example: simultaneous traffic spikes on 58 and APP sides, with corresponding cluster traffic changes.

Pearson correlation computes similarity across many metrics, quickly revealing related alarms.

Correlation results are displayed in the WeChat alarm interface, with root‑cause analysis and visual graphs.

Differences Between Traditional and Intelligent Monitoring

The 58 system evolved through four stages: automation, three‑dimensional coverage, productization, and intelligence.

Automation : automatic synchronization of cluster names, owners, and node lists from CMDB, and automatic association of monitoring templates.

Three‑dimensional : comprehensive coverage across horizontal and vertical dimensions.

Productization : enhanced user experience to lower the usage barrier for internal users.

Intelligence : incorporation of AI techniques to handle growing scale and complexity, moving toward fully intelligent operations.

Author: Gong Cheng, 58 TEG, focusing on intelligent operations and automation. Source: 58 Architects public account.
MonitoringMachine LearningObservabilityanomaly detectionintelligent operationscloud infrastructure
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.