How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance
This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.
Why Alert Governance Is a Never‑Ending Problem
In large‑scale backend systems, alerts are the first line of defense, yet teams often become desensitized because alerts either miss real failures or flood the channel with noise. The article starts by describing three typical scenarios where alerts fail to surface problems in time.
From Logic to Reality: The "Left" and "Right" Alerts
Ideally, a quality issue should be a sufficient and necessary condition for an alert, but in practice alerts are either missing (left) or overly sensitive (right). The author explains how both extremes arise in micro‑service environments.
Designing Precise, Segmented Error Codes
To make alerts both accurate and actionable, the team introduced a three‑segment error‑code scheme: module ID + error class + specific code . The module ID (two digits) identifies the service, the class (A‑D) indicates the error nature (user‑error, system error, downstream service, or component failure), and the specific code pinpoints the exact problem.
Examples include:
A‑class (1xxxx) – user‑behavior errors such as missing parameters.
B‑class (2xxxx) – internal system faults like resource exhaustion.
C‑class (3xxxx) – downstream service timeouts or failures.
D‑class (4xxxx) – infrastructure component errors (MySQL, Redis, Kafka, etc.).
All services share a global error‑code registry to avoid duplication and to enable quick root‑cause identification.
Alert Policy Design Based on Error Codes
With the error‑code taxonomy, the team can filter out noisy user‑error alerts in a single rule (e.g., ignore all codes where the third digit is 1). The primary alert strategy becomes a high‑threshold success‑rate rule (e.g., 99.9% success rate) applied uniformly across services, while auxiliary rules catch spikes in specific error codes or global success‑rate drops.
Driving Silence: Team Practices and Metrics
The article outlines a three‑stage process to push alert silence:
Stage 1 – Burst the biggest noise: Lower thresholds on low‑impact services, raise thresholds on critical services, and focus on the most frequent alert categories.
Stage 2 – Track each alert’s lifecycle: Weekly on‑call reviews, explicit ownership, and a “who‑is‑looking” metric to ensure alerts are acknowledged.
Stage 3 – Measure and optimize: Weekly alert count, proactive interception rate, handling time, and closure rate become KPIs, visualized in dashboards.
Tools such as a custom “fire‑sentry” card in the alert channel record who claimed an alert, the analysis steps, and the final resolution, turning the alert flow into a traceable process.
Results and Ongoing Work
After six months, the team reduced total alerts from over 2,700 per month to just 71, a 97.4% drop, surpassing the original 85% goal. However, the author stresses that alert governance is an endless journey because new features continuously introduce new bugs.
Key takeaways include:
Plan error codes strategically to enable long‑term governance.
Focus alert policies on success‑rate monitoring, using error‑code filters as supplements.
Establish clear on‑call processes and data‑review loops to keep alerts visible.
Quantify progress with metrics and automate the workflow.
Overall, the article provides a concrete, repeatable framework for backend teams to transform noisy alerts into meaningful, actionable signals.
if err != nil {
return err
} try {
callRpc();
} catch (Exception ex) {
throw new RuntimeException(ex);
} funcregError(code int32, defaultErrDetail string) Error {
_, exists := codeDetails[code]
if exists {
panic(fmt.Sprintf("error code %d already used, please redefine a new one!!", code))
}
codeDetails[code] = defaultErrDetail
return Error{Code: code, ErrDetail: defaultErrDetail}
}
var (
DCodeDefault = regError(40000, "调用基础组件错误")
DCodeMySQLError = regError(40100, "调用MySQL错误")
DCodeMySQLRecordNotFoundError = regError(40101, "没有查询到MySQL记录")
// ... other error definitions omitted for brevity
)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
