Operations 13 min read

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

This article explains why monitoring alert noise harms efficiency, presents metrics such as recall and accuracy, details rule‑based, blacklist/whitelist, ratio‑based, and intelligent noise‑reduction techniques, shares Java code examples, and shows measurable results after applying the governance process.

dbaplus Community

Nov 23, 2023

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

Background

Since May, Ant Group has been strengthening monitoring governance to achieve a five‑minute detection and thirty‑minute resolution target while filtering out noisy alerts that cause alert fatigue. The growing number of services and monitoring configurations increased the proportion of unhealthy alerts, prompting a dedicated noise‑reduction effort in June.

Why Noise Reduction Is Required

Avoid alert fatigue and improve efficiency – Excessive noisy alerts lead engineers to ignore warnings, increase operational workload, and reduce system reliability.

Save resources and keep systems stable – Frequent alerts consume CPU, memory, and network bandwidth, degrading performance. Reducing noise frees resources for the business workload.

Alert Governance

How to View Noise

Ant Group’s Aurora alert dashboard displays the total noise count, fault count, five‑minute detection rate, and thirty‑minute resolution rate. Users can label alerts as noise, filter by label, and trace each noisy alert back to its monitoring ID for targeted remediation.

Noise‑Reduction Methodology

Avoid Single‑Dimension Rules

Rules that rely on a single metric (e.g., only success count) generate alerts during peak periods. Multi‑dimensional conditions improve precision. Typical dimension combinations include:

Success volume + success rate vs. failure volume + success rate

Success volume + success rate + total volume

Success rate + failure count

Adjust the collection period according to business volume to avoid spikes caused by network jitter.

Use Black‑ and White‑Lists

Interfaces that frequently trigger alerts because of business characteristics can be added to a blacklist (suppress alerts) or a whitelist (lower sensitivity) after they are labeled as noise.

Leverage Ratio Comparisons (环比 & 同比)

Hour‑over‑hour (环比) and day‑over‑day (同比) comparisons detect abnormal changes in total volume, success rate, or error counts. Monitoring a five‑minute window against the previous five‑minute window highlights sudden spikes, while day‑level comparisons filter out normal daily variations.

Intelligent Noise Reduction

Alert suppression : Define a suppression period (e.g., 5 minutes) to mute stormy alerts.

Short‑cycle jitter : Detect brief spikes caused by network jitter and suppress alerts that fall within a configurable ratio of the collection period.

Spike‑then‑drop (冲高回落) : Suppress alerts when a metric spikes and returns to baseline within a configured time window.

CDO Alert Noise Handling

Complex Event Detection and Optimization (CDO) alerts require custom business exceptions. The following Java snippets illustrate a BizException class and a handling template that logs warnings for controllable business errors and errors for unexpected failures.

public class BizException extends RuntimeException {
    private static final long serialVersionUID = 5840651686530819567L;
    private ResultCodeEnum code = ResultCodeEnum.UN_KNOWN_EXCEPTION;
    public BizException(ResultCodeEnum code) {
        super(code.getMsg());
        this.code = code;
    }
    public BizException(ResultCodeEnum code, String message) {
        super(message);
        this.code = code;
    }
}

public class HandleTemplate {
    public void execute(final Response response, final HandleCallback action) {
        Profiler.enter("开始进入操作模板");
        try {
            action.doPreAction();
            action.doAction();
            action.doPostAction();
        } catch (BizException be) {
            if (EnvEnum.DEV.getType().equals(EnvUtil.getExactEnv())) {
                LogUtil.error(LOGGER, be, "业务异常:");
            } else {
                LogUtil.warn(LOGGER, be, "业务异常:");
            }
            ResultUtil.generateResult(response, be);
        } catch (IntegrationException ie) {
            LogUtil.error(LOGGER, ie, "查询业务异常:" + ie.getMessage());
            ResultUtil.generateResult(response, ie);
        } catch (Throwable e) {
            LogUtil.error(LOGGER, e, "操作系统异常:" + e.getMessage());
            ResultUtil.generateResult(response, ResultCodeEnum.SYSTEM_ERROR, e.getMessage());
        }
    }
}

Additional Practical Measures

Exclude pre‑release environments from alerting.

Configure active time windows that match business peak hours and limit alerts to one per N minutes.

Suppress duplicate or unnecessary alerts and fine‑tune subscription filters.

Route alerts to appropriate channels (SMS, email) based on severity to avoid notification overload.

Results

After the June governance effort, noise alerts dropped from 61.70 % of total events to 12.80 % while fault‑related alerts remained stable, indicating that recall stayed near 100 % and precision improved markedly.

Monthly noise rates illustrate the decline:

Continuous measurement of recall and precision after any rule adjustment is essential; otherwise, critical issues may be silently suppressed.

Monitoring operations incident management Alert Noise Reduction

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.