Tagged articles
17 articles
Page 1 of 1
Raymond Ops
Raymond Ops
Apr 20, 2026 · Operations

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

This article presents a complete SRE on‑call handbook that defines alert severity levels, provides concrete Prometheus Alertmanager configurations, outlines a step‑by‑step response flow, details war‑room roles, escalation paths, handoff checklists, post‑mortem procedures, and dozens of ready‑to‑use templates to reduce MTTR and improve reliability.

Alert ManagementOn-CallOperations
0 likes · 27 min read
How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates
MaGe Linux Operations
MaGe Linux Operations
Dec 10, 2025 · Operations

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

Alert ManagementOn-CallOperations
0 likes · 26 min read
Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates
Old Meng AI Explorer
Old Meng AI Explorer
Nov 26, 2025 · Operations

How Alertmanager Turns Chaos into Calm: Mastering Alert Management for DevOps

Alertmanager, the official Prometheus alert manager, consolidates redundant alerts, supports silencing, inhibition, multi‑channel routing, and high‑availability clustering, enabling DevOps teams to quickly pinpoint critical issues, reduce noise, and streamline incident response across large server fleets with simple YAML configuration and command‑line tools.

Alert ManagementAlertmanagerDevOps
0 likes · 10 min read
How Alertmanager Turns Chaos into Calm: Mastering Alert Management for DevOps
dbaplus Community
dbaplus Community
Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert Managementbackend operationserror code design
0 likes · 42 min read
How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance
Bilibili Tech
Bilibili Tech
Sep 8, 2023 · Operations

Design, Implementation, and Governance of an Alert Management Platform

The article details Bilibili’s comprehensive alert‑management platform—its background, cloud‑vs‑self‑built solution comparison, closed‑loop design, distributed architecture, rule configuration, noise‑reduction, automated root‑cause analysis, and governance practices that cut weekly alerts from 1,000 to under 80, while outlining future enhancements.

Alert ManagementDevOpsSRE
0 likes · 19 min read
Design, Implementation, and Governance of an Alert Management Platform
DataFunSummit
DataFunSummit
Apr 15, 2023 · Operations

Observability and Intelligent Alert Management Practices

This presentation outlines the observability ecosystem, the role and value of alerts within it, core functionalities of an intelligent alarm management platform, best‑practice recommendations, and a real‑world case study of deploying a unified observability solution for a large state‑owned investment group.

Alert ManagementIT Operationsaiops
0 likes · 11 min read
Observability and Intelligent Alert Management Practices
Alibaba Cloud Native
Alibaba Cloud Native
Dec 12, 2022 · Cloud Native

How ACK One Enables Multi‑Cluster GitOps and Unified Alert Management

ACK One is a distributed cloud‑native container platform that unifies management of Kubernetes clusters across hybrid‑cloud, edge, and on‑prem environments, offering GitOps‑based multi‑cluster application distribution with ArgoCD integration and a centralized alert‑management system.

Alert ManagementArgoCDGitOps
0 likes · 9 min read
How ACK One Enables Multi‑Cluster GitOps and Unified Alert Management
Efficient Ops
Efficient Ops
Sep 28, 2022 · Operations

How Event‑Driven Alert Centers Revolutionize Intelligent Operations

This article presents a comprehensive overview of an event‑centric intelligent alert analysis platform, covering its evolution, core challenges, the concept of alert events, AI‑driven correlation techniques, and the MC‑Stack platform that powers modern operations.

Alert Managementaiopsevent-driven monitoring
0 likes · 13 min read
How Event‑Driven Alert Centers Revolutionize Intelligent Operations
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 26, 2022 · Operations

How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices

This article analyzes the challenges of alert overload in large‑scale microservice environments and presents a systematic approach—including timeliness metrics, a maturity model, lifecycle tracking, feedback loops, downgrade mechanisms, and cross‑service aggregation—to improve alert effectiveness and reduce noise.

Alert ManagementMTTRMicroservices
0 likes · 16 min read
How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices
Efficient Ops
Efficient Ops
Jun 1, 2022 · Operations

What Can Aircraft Monitoring Teach Us About Building Effective IT Operations Monitoring?

The article explores how aviation‑grade monitoring concepts—such as multi‑level alarm classification, diverse alert delivery methods, and comprehensive sensor coverage—can inspire centralized, data‑driven IT operations monitoring architectures that reduce missed alerts, false positives, and improve response times.

Alert ManagementDigital Twinaiops
0 likes · 33 min read
What Can Aircraft Monitoring Teach Us About Building Effective IT Operations Monitoring?
DeWu Technology
DeWu Technology
May 16, 2022 · Operations

NOC SLA Implementation for Consumer Trading Platform

To tackle growing production complexity and past incident delays, the consumer trading platform introduced a three‑tier NOC‑SLA with intelligent baselines powered by Facebook Prophet, streamlined alert rules, and an SOS‑linked workflow, boosting detection frequency, cutting critical response times to under five minutes, and improving overall system reliability while emphasizing ongoing baseline and rule maintenance.

Alert ManagementNOCOperations
0 likes · 13 min read
NOC SLA Implementation for Consumer Trading Platform
Efficient Ops
Efficient Ops
Jun 15, 2021 · Operations

Mastering IT Monitoring: Strategies, Challenges, and Best Practices

This article explores the fundamentals of IT monitoring, examines common challenges such as scalability, reliability, and alert fatigue, compares four implementation approaches—from open‑source to fully custom solutions—and presents practical techniques like alert convergence, suppression, and automation to build a robust, adaptable monitoring platform.

Alert ManagementAutomationOperations
0 likes · 19 min read
Mastering IT Monitoring: Strategies, Challenges, and Best Practices
Sohu Tech Products
Sohu Tech Products
Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

Alert ManagementOn-CallOperations
0 likes · 15 min read
Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue
Efficient Ops
Efficient Ops
Aug 12, 2019 · Operations

Mastering Alert Storms: The 5‑Level Maturity Model for Modern Ops

As cloud, container, and micro‑service architectures increase system complexity, this article explains why alert overload occurs, introduces a five‑level alert‑management maturity model, and shows how AIOps‑driven automation can transform chaotic notifications into efficient, self‑healing operations.

Alert Managementaiops
0 likes · 11 min read
Mastering Alert Storms: The 5‑Level Maturity Model for Modern Ops
Efficient Ops
Efficient Ops
Jul 11, 2016 · Operations

How Tencent's Intelligent Monitoring Transforms Ops Automation

Leveraging Tencent's extensive experience in social platform operations, this talk explores intelligent monitoring practices—covering active, passive, and side‑channel techniques, full‑link observability, data processing pipelines, and alert convergence—to enhance reliability, availability, and user experience while reducing noise for ops teams.

Alert ManagementAutomationBig Data
0 likes · 22 min read
How Tencent's Intelligent Monitoring Transforms Ops Automation