Operations 18 min read

How China Mobile Built a Scalable AIOps Platform to Cut Incident Resolution Time

This article shares China Mobile IT Center's four‑year journey of designing, deploying, and refining a centralized AIOps platform that automates anomaly detection, fault diagnosis, and remediation, dramatically reducing complaint ticket handling from ten to six hours while scaling to billions of AI model calls per month.

Efficient Ops
Efficient Ops
Efficient Ops
How China Mobile Built a Scalable AIOps Platform to Cut Incident Resolution Time

China Mobile Centralized AIOps Overview

China Mobile's IT Center built a centralized AIOps platform to support digital transformation, aiming to help operators detect faults quickly and free them from repetitive tasks.

Guided by the AIOps white‑paper, the team defined five quality dimensions—quality assurance, efficiency improvement, cost management, etc.—and created eight capability dimensions covering 33 scenarios, later focusing on nine production‑relevant scenarios for CRM/BOSS systems.

All scenarios were platformized on the IT Intelligent Decision Platform, which serves as a second workbench for operators, providing real‑time status awareness and automated fault‑discovery, diagnosis, and healing workflows.

After years of practice, the platform processes over one billion model calls per month, hosts more than 50,000 AI models, and analyzes over 40 TB of logs daily, delivering measurable improvements in fault detection, localization, recovery, and service quality.

Four‑Year Development Stages

Stage 1 (2019‑end) : Enthusiastic kickoff when AIOps was hot; many vendors wanted to collaborate.

Stage 2 (2020‑2021) : Realized gaps between expectations and results; faced numerous technical and managerial challenges, which were tackled one by one.

Stage 3 (2022‑present) : Gained a clear definition of AIOps and a three‑to‑five‑year roadmap.

Challenges from Anomaly Detection

The team started with anomaly detection as a generic, fast‑to‑market use case in early 2020, mapping it to four of the nine standard ML pipeline stages: problem definition, data collection, basic model training, and continuous learning.

Scaling models from one province to others proved difficult; models could not be directly reused across regions, requiring extensive expert‑in‑the‑loop refinement.

Operational Optimization Practices

Standardized processes are a prerequisite for automation and subsequent intelligence. The workflow includes ten steps, with data‑ingestion assessment, precision‑recall filtering, and negative‑sample feedback requiring expert input, while the remaining steps are fully AI‑driven.

Data quality is critical; poor data cannot be compensated by AI alone. The team enforces strict data governance for both periodic and non‑periodic streams.

Rather than chasing high detection accuracy (which often yields diminishing returns), the focus is on system stability and rapid recovery. The platform combines anomaly detection with “boundary diagnosis” to pinpoint affected modules, though it cannot always explain the root cause.

When an anomaly is detected, the system automatically executes predefined remediation strategies such as disk cleanup, service restart, or failover, ensuring continuity even if false alarms occur.

Human‑in‑the‑Loop and Feedback Loops

After deployment, operators (first‑line, second‑line, and support) use the platform, report issues, and provide feedback, enabling continuous model improvement.

Standardized templates and incident‑handling procedures integrate AIOps capabilities into daily operations, and real‑world fault cases validate each capability.

Case Study: July 2022 Incident

An overload of timeout errors triggered an alarm storm; AIOps quickly correlated alarms, identified a failing service node, and traced the root cause to an external platform outage, reducing detection time to 2 minutes and diagnosis time to under 3 minutes.

Compared with expert investigation (≈30 minutes), the AI‑assisted workflow cut response time by an order of magnitude, demonstrating the value of multi‑scenario coordination.

Insights and Reflections

The team learned that AI models excel when combined with expert rules, that not all scenarios are mature enough for full automation, and that focusing on a handful of well‑engineered use cases yields the greatest operational impact.

Key takeaways include the need for abundant data and domain knowledge, clear metric definitions, predictable patterns, and a focus on stable, repeatable processes rather than chasing perfect accuracy.

Ultimately, a cohesive multi‑scenario AIOps platform can automate fault discovery, diagnosis, and healing, freeing operators from repetitive tasks and enhancing overall system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIincident managementaiops
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.