Operations 9 min read

How Xiami’s SRE Team Revamped Monitoring to Cut Alert Noise by 90%

Xiami’s SRE team overhauled its monitoring system by categorizing alerts, introducing fault, generic, and basic monitoring, optimizing alert paths with stream processing, and leveraging Alibaba’s traffic scheduling platform, dramatically reducing daily noise from thousands of alerts to a manageable few hundred critical notifications.

Efficient Ops
Efficient Ops
Efficient Ops
How Xiami’s SRE Team Revamped Monitoring to Cut Alert Noise by 90%

Background

Monitoring is a key means for backend services to understand application status. Alibaba's Xiami service now runs over 100 Java applications, with nearly 50 core‑business apps, each having varied monitoring configurations. Over time, many monitoring items became outdated, leading to excessive, often irrelevant alerts that overwhelmed teams.

Alarm Cause Analysis

Previous configurations focused on overall RT, QPS, and some business logs, making it hard to pinpoint the exact problem when an alarm fired. Typical causes for sudden alerts after stable operation include:

Program bugs such as null pointers or frequent Full GC.

Upstream dependency failures causing timeouts or call errors.

Single‑machine failures like sudden load or CPU spikes.

Middleware faults (Cache, DB, etc.) that increase RT or timeouts.

Monitoring Optimization

After analyzing alarm causes, the team redesigned monitoring into three categories: fault monitoring, generic monitoring, and basic monitoring.

Fault Monitoring

Fault monitoring tracks external factors that affect an application’s interface RT and success rate. Core interface metrics—success rate, RT (using the 75th percentile), and error codes—are monitored. When these thresholds are breached, it indicates a user‑visible performance issue.

Additional fault monitors capture application‑level exceptions, errors, and message anomalies to quickly detect program bugs and trigger possible rollbacks.

Generic Monitoring

When most issues stem from single‑machine failures, generic monitoring highlights abnormal metrics on a per‑machine basis, helping identify the problematic host. The team also added HSF (Dubbo) thread‑pool‑full and timeout monitors to detect high load or CPU problems.

Basic Monitoring

Basic monitoring uses Sunfire to track middleware metrics such as CPU, load, JVM, HSF, and MetaQ. Significant alerts here indicate middleware faults.

Alert Path Optimization

Each application now has 30‑50 alert items. Sending all alerts to a single group would be chaotic, so the SRE platform aggregates alerts using stream processing, applies severity levels, and routes them to the relevant owners via DingTalk bots. This reduces daily alerts from ~5,000 to 50‑100 critical messages.

Leveraging Traffic Scheduling

To handle single‑machine failures more efficiently, the team uses Alibaba Cloud’s traffic scheduling platform (AHAS) to automatically divert traffic away from problematic machines and restore it once recovered. This also helps with pre‑warming releases and prevents load‑induced timeouts.

Avoids release‑induced RT and load spikes.

Mitigates local machine overload, host‑related slow calls, and HSF thread‑pool saturation.

Currently, about 40 applications have integrated with the traffic scheduling platform, performing over 1,000 traffic switches weekly, greatly reducing alert noise from single‑machine issues.

AlibabamonitoringOperationsSREtraffic schedulingalert optimization
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.