How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples
This article explains why monitoring alert noise harms efficiency, presents metrics such as recall and accuracy, details rule‑based, blacklist/whitelist, ratio‑based, and intelligent noise‑reduction techniques, shares Java code examples, and shows measurable results after applying the governance process.
Background
Since May, Ant Group has been strengthening monitoring governance to achieve a five‑minute detection and thirty‑minute resolution target while filtering out noisy alerts that cause alert fatigue. The growing number of services and monitoring configurations increased the proportion of unhealthy alerts, prompting a dedicated noise‑reduction effort in June.
Why Noise Reduction Is Required
Avoid alert fatigue and improve efficiency – Excessive noisy alerts lead engineers to ignore warnings, increase operational workload, and reduce system reliability.
Save resources and keep systems stable – Frequent alerts consume CPU, memory, and network bandwidth, degrading performance. Reducing noise frees resources for the business workload.
Alert Governance
How to View Noise
Ant Group’s Aurora alert dashboard displays the total noise count, fault count, five‑minute detection rate, and thirty‑minute resolution rate. Users can label alerts as noise, filter by label, and trace each noisy alert back to its monitoring ID for targeted remediation.
Noise‑Reduction Methodology
Avoid Single‑Dimension Rules
Rules that rely on a single metric (e.g., only success count) generate alerts during peak periods. Multi‑dimensional conditions improve precision. Typical dimension combinations include:
Success volume + success rate vs. failure volume + success rate
Success volume + success rate + total volume
Success rate + failure count
Adjust the collection period according to business volume to avoid spikes caused by network jitter.
Use Black‑ and White‑Lists
Interfaces that frequently trigger alerts because of business characteristics can be added to a blacklist (suppress alerts) or a whitelist (lower sensitivity) after they are labeled as noise.
Leverage Ratio Comparisons (环比 & 同比)
Hour‑over‑hour (环比) and day‑over‑day (同比) comparisons detect abnormal changes in total volume, success rate, or error counts. Monitoring a five‑minute window against the previous five‑minute window highlights sudden spikes, while day‑level comparisons filter out normal daily variations.
Intelligent Noise Reduction
Alert suppression : Define a suppression period (e.g., 5 minutes) to mute stormy alerts.
Short‑cycle jitter : Detect brief spikes caused by network jitter and suppress alerts that fall within a configurable ratio of the collection period.
Spike‑then‑drop (冲高回落) : Suppress alerts when a metric spikes and returns to baseline within a configured time window.
CDO Alert Noise Handling
Complex Event Detection and Optimization (CDO) alerts require custom business exceptions. The following Java snippets illustrate a BizException class and a handling template that logs warnings for controllable business errors and errors for unexpected failures.
public class BizException extends RuntimeException {
private static final long serialVersionUID = 5840651686530819567L;
private ResultCodeEnum code = ResultCodeEnum.UN_KNOWN_EXCEPTION;
public BizException(ResultCodeEnum code) {
super(code.getMsg());
this.code = code;
}
public BizException(ResultCodeEnum code, String message) {
super(message);
this.code = code;
}
} public class HandleTemplate {
public void execute(final Response response, final HandleCallback action) {
Profiler.enter("开始进入操作模板");
try {
action.doPreAction();
action.doAction();
action.doPostAction();
} catch (BizException be) {
if (EnvEnum.DEV.getType().equals(EnvUtil.getExactEnv())) {
LogUtil.error(LOGGER, be, "业务异常:");
} else {
LogUtil.warn(LOGGER, be, "业务异常:");
}
ResultUtil.generateResult(response, be);
} catch (IntegrationException ie) {
LogUtil.error(LOGGER, ie, "查询业务异常:" + ie.getMessage());
ResultUtil.generateResult(response, ie);
} catch (Throwable e) {
LogUtil.error(LOGGER, e, "操作系统异常:" + e.getMessage());
ResultUtil.generateResult(response, ResultCodeEnum.SYSTEM_ERROR, e.getMessage());
}
}
}Additional Practical Measures
Exclude pre‑release environments from alerting.
Configure active time windows that match business peak hours and limit alerts to one per N minutes.
Suppress duplicate or unnecessary alerts and fine‑tune subscription filters.
Route alerts to appropriate channels (SMS, email) based on severity to avoid notification overload.
Results
After the June governance effort, noise alerts dropped from 61.70 % of total events to 12.80 % while fault‑related alerts remained stable, indicating that recall stayed near 100 % and precision improved markedly.
Monthly noise rates illustrate the decline:
Continuous measurement of recall and precision after any rule adjustment is essential; otherwise, critical issues may be silently suppressed.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
