How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale
This article examines Baidu’s Noah monitoring and alarm platform, detailing its end‑to‑end fault‑handling workflow, the three‑component architecture, and the practical challenges of deploying AIOps—such as long algorithm iteration cycles, complex alarm management, and alarm storms—while highlighting scalability and commercial considerations.
Baidu Noah Monitoring and Alarm System
Monitoring alarms are a crucial part of fault discovery and represent one of Baidu's earliest entry points into AIOps. Baidu AIOps has achieved notable results in two scenarios: intelligent anomaly detection and intelligent alarm merging.
System Overview
The Noah system provides comprehensive, three‑dimensional monitoring and alarm capabilities for Baidu’s internal platforms, covering all product lines and serving both public‑cloud and private‑cloud customers. It processes tens of millions of data points per second, with millions of monitoring configurations and generates tens of millions of alarm events daily while maintaining sub‑second alarm latency.
Standard Fault‑Handling Process (7 Steps)
Fault Occurrence : e.g., a core switch fails, causing network outage and traffic loss.
Fault Detection : monitoring system detects abnormal traffic.
Fault Notification : alerts are sent via SMS, phone, etc.
Fault Mitigation : operators execute loss‑prevention actions, such as traffic cut‑over.
Fault Localization : operators and developers pinpoint the root cause.
Fault Recovery : repair actions restore all services.
Fault Summary : post‑mortem analysis and improvement planning.
The monitoring system is responsible for steps 2‑5, while the alarm subsystem focuses on detection and notification.
Business Model
Internally, Noah serves Baidu’s own services; externally, it powers the NoahEE product for cloud customers. The platform also underpins Baidu’s AIOps offerings, including intelligent anomaly detection, fault localization, and alarm merging, which have been deployed in finance, transportation, and internet sectors.
Core Workflow Example
Assume a product line’s traffic metric (PV) should trigger an alarm when it falls below 100. The system periodically evaluates the latest data point; a transition from normal to abnormal creates an alarm event, which generates one or more alarm messages rendered as human‑readable text and delivered via downstream channels (SMS, phone, etc.).
Subsystem Decomposition
Anomaly Detection System : periodically evaluates data against static rules or AIOps algorithms, emitting normal/abnormal results.
Event Management System : handles alarm events, providing debounce filtering, claim, escalation, and silencing features.
Notification Sending System : merges, renders, and dispatches alarm messages, with quota and flow‑control to protect downstream gateways.
Splitting the system yields clear functional boundaries, extensible architecture, and flexible commercial delivery—each component can be independently upgraded or offered as a standalone service.
Challenges Encountered
1. Lengthy AIOps Algorithm Iteration
PV traffic exhibits diurnal and weekly patterns, making static thresholds ineffective. Baidu developed an unsupervised robust‑regression algorithm that detects sudden spikes without preset thresholds. However, different metrics require different models, models must be retrained as data evolves, and CPU requirements vary widely—from lightweight tasks fitting on a single core to deep‑learning RNNs demanding dedicated resources.
The deployment workflow involves strategy engineers writing Python/Matlab scripts, developers porting them to C++/Java for production, testing engineers performing regression tests, and operations engineers releasing the modules—exposing high skill requirements, long iteration cycles, and tightly coupled model‑code updates.
2. Complex Alarm Management Requirements
Real‑world incidents demand features such as debounce filtering for short‑lived spikes, repeated alerting when acknowledgments are missing, claim functionality to stop duplicate alerts, escalation to senior engineers for prolonged issues, and callback mechanisms for automated remediation. Additional needs include silence periods and flow control to prevent gateway overload.
3. Alarm Storms Overwhelm Core Alerts
Layered monitoring creates massive alert volumes during a single fault. Examples include:
Machine failure triggering hardware health alerts, instance health alerts, and upstream application alerts—resulting in dozens of messages.
Application module failure causing alerts from all instances and upstream modules—potentially hundreds of messages.
Data‑center outage producing network, machine, DNS, application, and business‑level alerts—reaching tens of thousands of messages.
Such storms flood on‑call engineers, making it difficult to identify root causes quickly.
Conclusion
The article introduced the functionality and business model of Baidu’s Noah monitoring and alarm system, then analyzed the practical challenges faced when applying AIOps at scale. A follow‑up article will discuss solutions to these challenges.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.