Operations 15 min read

Scaling Monitoring to Millions of Metrics with Open‑Source and AIOps

This talk shares China Mobile Online Service’s journey of building a nationwide, software‑defined monitoring platform, detailing the shift from legacy PBX systems to open‑source tools, the challenges of scaling to millions of metrics, and how AI‑driven AIOps is used to automate, compress, and intelligently alert on massive operational data.

Efficient Ops
Efficient Ops
Efficient Ops
Scaling Monitoring to Millions of Metrics with Open‑Source and AIOps

Speaker Introduction

Wang Manxue Technical Manager, China Mobile Online Service Co., Ltd.

1. Background – Nationwide Centralized Maintenance, the Largest Globally

China Mobile operates the 10086 customer service system, serving billions of users across phone, WeChat, Weibo, and the 10086 app. The company was founded in October 2014 and runs a centralized, professional operation platform used by 31 subsidiaries.

Traditional call‑center hardware (PBX, dedicated voice boards, etc.) suffers from inflexibility, high cost, long deployment cycles, poor multi‑channel support, and slow response to business needs.

The new software‑defined call platform features:

Pure software: full‑media CTI, IVR, internet gateway, softswitch, media acceleration, user terminals.

Rich media: voice, text, image, video, short voice, WeChat, Weibo.

Intelligence: AI and big‑data driven IVR, robot answering, quality inspection, outbound calls.

Centralization: unified connection, CRM, analytics, quality inspection, traffic monitoring.

Operational challenges include massive user scale (≈900 million users, 20 000+ servers, 50 000+ Tomcat instances), rapid business changes, and telecom‑grade reliability requirements (99.999% uptime, 15‑second answer time, 24/7 support).

Monitoring became the first bottleneck; the new system required agility, centralization, automation, and intelligence.

2. Path Forward – Choosing Open‑Source

To rebuild the monitoring system, the team adopted open‑source tools with custom development, achieving cross‑domain, cross‑vendor, and cross‑layer monitoring, plus flexible data visualization.

The unified platform replaced 31 independent monitoring stacks, providing one‑click agent deployment for subsidiaries and delivering over 80 ready‑made templates for hosts, OS, network, and applications.

Within six months the platform covered 99% of baseline monitoring for the headquarters and 31 subsidiaries: 20 000 hosts, 2 million metrics, and 300 000 daily alerts.

The system now supports major operating systems, middleware, and network devices, and allows minute‑level configuration for diverse business scenarios.

Key Zabbix optimizations include database SSD usage, kernel and TCP tuning, disabling unnecessary discovery, API‑driven agent management, and process‑level tuning (pollers, Java pollers, pingers, trappers, etc.).

Two real‑world cases illustrate these optimizations:

When metrics exceeded 1.5 million, the preprocessing manager saturated; the solution involved adding a dedicated proxy, redistributing agents, reducing poller counts, and adding self‑monitoring alerts.

At 2 million metrics, message queues stalled due to a short

max_execution_time

; extending this timeout and adding alerting for long‑running queries resolved the issue.

3. Transformation – Remaining Issues

Even after optimization, the platform still missed many business‑impacting incidents because monitoring focused on infrastructure rather than user experience.

The team shifted to an end‑to‑end, business‑centric monitoring model, emphasizing user‑experience metrics, full‑stack visibility, and SRE‑style golden signals.

Log volume reached 22 billion entries per day; a unified log platform using real‑time big‑data analytics now extracts anomalies and feeds them into monitoring.

For inter‑company interfaces (≈3 000 APIs per subsidiary), Zabbix would require ~24.8 million metrics; the team therefore adopted Prometheus for its flexible labeling and efficient metric collection.

Data from multiple open‑source monitors are consolidated into an operations data mart, providing real‑time and batch processing layers for downstream applications.

Automation and self‑service reduced the support staff needed to handle ~100 daily requests per subsidiary from 15 vendors to 4 internal engineers.

Standardization of monitoring templates proved essential; early ad‑hoc deployments caused rework, highlighting the need for proper requirement analysis and governance.

Current scale: 6.14 million monitoring items, ~500 billion log entries per day, and 1.98 million alerts daily.

4. Evolution – AIOps Experiments in Alerting

With 1.98 million daily alerts, the team faced alert fatigue. Analysis showed that 80% of false alerts stemmed from unreasonable thresholds, and 70% of missed alerts were due to missing thresholds.

Threshold setting evolved from static expert‑defined values, to data‑driven baselines derived from three months of history, to AI‑driven dynamic thresholds.

An LSTM model trained on historical metric data predicts future trends; significant deviations trigger alerts, offering higher precision than static thresholds.

Human operators can label alerts to continuously improve the model, creating a feedback loop that refines accuracy.

The AIOps approach leverages abundant operational data, mature AI libraries such as TensorFlow, and modest compute resources (standard VMs) to deliver lightweight, intelligent monitoring.

Current AI‑enabled initiatives include log anomaly detection, alert correlation, rule mining, root‑cause analysis, and capacity management, inviting further collaboration.

Note: This content is based on Wang Manxue’s presentation at GOPS 2019 Shenzhen.

MonitoringBig DataoperationsPrometheusopen-sourceAIOpsZabbix
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.