Operations 14 min read

Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response

This guide outlines comprehensive SRE monitoring and alerting standards, covering core principles, log instrumentation, health‑check requirements, baseline resource and application metrics, alarm severity tiers, response SLAs, on‑call rotation, continuous optimization, and noise‑reduction mechanisms to ensure reliable service operation.

Architect-Kip

Mar 4, 2026

Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response

Core Management Principles

No monitoring, no release : New services or core features must have monitoring configured before they are deployed.

Owner responsibility : The developer who writes a module is accountable for the monitoring coverage and alarm handling of that module.

Alarm closed‑loop : Every active alarm must be resolved and documented; “report‑only” alerts are prohibited.

R&D Phase Instrumentation

Log and Metric Guidelines

ERROR level usage (mandatory)

Do not log business‑logic validation failures (e.g., "wrong password", "out of stock") at ERROR level. ERROR is reserved for system exceptions (NPE, DB disconnection) or fatal errors that break core business flows.

Misuse leads to false‑positive P0/P1 alerts and will be counted in quality assessments.

Critical‑path logging (mandatory)

Key actions such as payment, order receipt, and login must log a business identifier (e.g., orderId) for traceability.

All logs must be shipped to the company‑wide log platform to enable end‑to‑end tracing.

Custom metrics (suggested)

Define thresholds for critical business, e.g., spike in error return codes within 5 minutes or keyword count exceeding a threshold within 5 minutes.

Health‑check endpoint (mandatory)

Expose a standard endpoint such as /actuator/health that reflects real service status (DB, Redis connections, etc.). Static "OK" responses are forbidden.

Release Integration Standards

Baseline Monitoring (SRE / Operations unified config)

Resource metrics: CPU, memory, disk, network I/O.

Application metrics: JVM (GC frequency, heap usage), HTTP 5xx ratio, thread‑pool utilization.

Middleware metrics: databases, Redis, message queues, etc.

Business Alarm Configuration (R&D owner)

Core API success rate : Alert if success rate < 95 % (based on HTTP status and business return codes).

Core API response time : Alert if P99 latency > 2 s.

ERROR log monitoring : Trigger an alert when ERROR log count ≥ 1.

Business anomaly monitoring : Examples – no orders for 5 minutes, payment success rate drops to zero.

Alarm Severity Levels

P0 (phone + SMS/email) : System completely unavailable, core flow blocked, major financial loss risk.

P1 (SMS/email + IM) : Core feature degraded, non‑core unavailable, error‑rate spikes.

P2 (IM notification) : Single‑machine exception, non‑core API errors, resource‑usage warnings.

Alarm Response and Handling

On‑call Rotation

Each service/team must maintain a 24 × 7 on‑call schedule for P0/P1 alerts.

On‑call personnel must keep phones reachable; silencing or powering off is prohibited.

SLA

P0 : Acknowledge within 5 minutes and provide mitigation within 30 minutes.

P1 : Acknowledge within 15 minutes.

If the response time exceeds the limit, the system auto‑escalates to team leader → director → CTO.

Alarm Closure

For each P0/P1 alert, record root cause, action taken, and optimization suggestions in the alert platform or post‑mortem document.

Never ignore alerts; persistent false‑positives must have thresholds adjusted or be disabled.

Continuous Optimization and Assessment

Weekly review of alert statistics in team meetings.

Top‑3 most frequent alerts must be fixed or have thresholds tuned within a week.

Penalties: Unmonitored failures lasting > 1 hour are classified as “responsibility accidents”; missed P0 calls reduce quarterly performance weight; repeated false‑positive ERROR alerts (≥ 3 times/week) trigger formal criticism.

Design Guidelines

Actionability : Every alarm must indicate the required action; unused alarms should be removed.

Tiered response : Core link failures are P0, operational slowdowns are P3, etc.

Full‑stack view : Monitor business KPIs (e.g., order volume) in addition to infrastructure metrics.

Alarm Tier Definitions

P0 (Disaster)

Core business completely unavailable (e.g., payment API all fail, primary DB down, OOM, network outage).

Notification: phone + SMS + strong IM reminder.

SLA: respond within 5 minutes, recover within 30 minutes.

P1 (Severe)

Core business partially degraded or non‑core unavailable (e.g., order export fails, latency spikes, error rate > 5 %).

Notification: SMS + IM with strong reminder.

SLA: respond within 15 minutes, recover within 2 hours.

P2 (Warning)

System metric abnormal but business not yet impacted; risk of deterioration (e.g., CPU > 80 % for 3 min, disk > 85 %, slow‑SQL surge, single‑machine exception).

Notification: IM bot.

SLA: respond within 1 hour, handle the same day.

P3 (Info)

Informational alerts for analysis or routine checks (e.g., business volume fluctuation, non‑critical task failure, system restart notice).

Notification: email / silent IM.

SLA: acknowledge within 24 hours.

Monitoring Metric体系

Infrastructure Layer

CPU usage > 80 % for 3 minutes.

Memory usage > 90 % (distinguish cache vs actual usage).

Disk usage > 85 % (P2) or > 95 % (P0, may block log writes).

Network: bandwidth saturation, packet loss, latency, socket count.

Application / JVM Layer

Health‑check endpoint non‑200 (e.g., /actuator/health).

Full GC > 1 times/hour; GC duration > 1 s.

Heap usage > 90 % with no decline (possible memory leak).

Thread‑pool saturation (ActiveCount ≥ MaxSize) for Tomcat or custom pools.

Middleware Layer

MySQL: master‑slave lag > 1 s, connection usage > 80 %, slow‑SQL surge, deadlocks, long transactions.

Redis: high memory fragmentation, evicted‑keys spike, command latency, fork latency, cache hit rate, replication lag.

Message Queue: lag > 10 000 and growing, consumer offline.

Scheduler: batch jobs not started or not finished by deadline, failure rate.

Business Layer

Key KPI trends: 5‑minute order volume down 50 % vs yesterday, order success rate < 95 %.

Log keyword monitoring: presence of ERROR, warn > 10 times in 5 minutes, specific keywords such as “Deadlock”, “DataIntegrityViolation”.

Threshold Setting Examples

Avoid instantaneous thresholds; use a duration or window to reduce noise. Example: CPU > 80 % for 3 minutes or 3 consecutive checks before alerting.

Alarm Noise Reduction and Governance

Silence : After an alarm fires, suppress repeat alerts for 1 hour if the issue persists.

Debounce : Require “N out of M” detections before alerting (e.g., 3 consecutive CPU spikes).

Grouping : Merge multiple identical alerts into a single aggregated message.

Effective time : P2/P3 alerts are sent only during work hours (09:00‑20:00); P0/P1 alerts are 24 h.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring operations Metrics SRE Alerting thresholds incident-response

Written by

Architect-Kip

Daily architecture work and learning summaries. Not seeking lengthy articles—only real practical experience.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.