Operations 9 min read

Mastering SRE: Key Questions on Monitoring, Capacity, and Change Management

This article provides a comprehensive SRE guide covering senior role definitions, monitoring objectives and implementation, core metric selection, link and event monitoring, capacity planning and mitigation strategies, a real‑world health‑code outage case, and change‑management best practices to improve reliability and efficiency.

dbaplus Community

Aug 13, 2023

Mastering SRE: Key Questions on Monitoring, Capacity, and Change Management

Monitoring (监控篇)

Purpose : Detect anomalies quickly, locate the root cause, and resolve incidents promptly.

How to Add Monitoring

Provide a unified entry portal for all metrics.

Ensure core metrics have high precision and recall.

Instrument end‑to‑end business flows (upstream/downstream links).

Strengthen basic metrics and indicator coverage.

Choosing Core Metrics

Core metrics should directly reflect business health and guide problem discovery. Typical examples:

Search latency for a search engine.

Session duration for recommendation systems (e.g., Douyin, Kuaishou).

Upload/file count for storage services.

Transaction volume or amount for e‑commerce platforms.

Link (Path) Monitoring

A link represents a complete request flow, e.g., CDN → LVS → Nginx → Application Server → DB. Full‑path monitoring lets engineers view the entire chain, quickly isolate the failing segment, and reduce the time spent on business‑knowledge transfer.

Basic Monitoring

Basic host‑level metrics (CPU, memory, disk I/O, network I/O, process health) are essential for pinpointing the exact component that is malfunctioning during an incident.

Event Monitoring

Change events (deployment, configuration updates).

Operational events (service start/stop, scaling actions).

Network events (latency spikes, packet loss).

Alarm Optimization

Alarm merging : Consolidate alerts with the same root cause into a single notification.

Alarm escalation : Implement tiered alert levels so that critical alerts trigger faster, higher‑severity responses.

Capacity Planning (容量篇)

Purpose : Balance resource usage, system stability, and business growth to maximize traffic handling within limited infrastructure.

Measuring Capacity

Entry‑level services: measure by QPS (queries per second).

Internal services: measure primarily by CPU utilization (most services are CPU‑bound).

Data Sources

Capacity data is collected from load‑testing results, routine monitoring, and operational experience, then stored in a dedicated capacity‑management platform.

Handling Insufficient Capacity

Rapid scaling or auto‑scaling using cloud resources.

Rate limiting to protect downstream services.

Service degradation : keep only essential features (e.g., status indicators) while disabling non‑critical functionality.

Caching recent query results to reduce repeat load.

Case Study: Xi'an Health‑Code Outage (Dec 2021)

The system crashed when a sudden traffic surge exceeded its capacity, preventing users from scanning QR codes for nucleic‑acid testing.

Root cause : Unexpected load, insufficient scaling capacity, and lack of traffic diversion.

Mitigation steps :

Apply rate limiting to protect the service.

Scale out quickly using cloud resources.

Degrade non‑essential features, retaining only critical status information.

Cache recent queries to reduce repeat requests.

Change Management (变更篇)

Goal : Achieve a balance between deployment efficiency and system stability. Over 60 % of incidents are change‑related.

Reducing Change Impact

Standardized processes : Define graded release standards, time‑window policies, and approval workflows.

Shuttle‑bus mechanism : Limit core‑service releases to 1–2 times per day.

Approval windows : Typically 10 am–7 pm on weekdays.

Forced pause : Restrict instance usage to a configurable 30‑100 % based on business needs.

Manual checks : Attach relevant monitoring metrics to deployment tickets.

Fast rollback / traffic shifting : Immediately revert or reroute traffic when an anomaly is detected.

Automated checks : Batch‑verify key metrics after deployment.

Automatic fault handling : Auto‑remove faulty instances or shift traffic across zones without human intervention.

Intelligent Checking Tool

Graded releases require extensive metric validation, which can slow down deployments. An intelligent checking tool automates metric verification by applying default algorithms and incorporating upstream/downstream service awareness, reducing manual effort and improving release speed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE Change Management capacity

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Monitoring (监控篇)

How to Add Monitoring

Choosing Core Metrics

Link (Path) Monitoring

Basic Monitoring

Event Monitoring

Alarm Optimization

Capacity Planning (容量篇)

Measuring Capacity

Data Sources

Handling Insufficient Capacity

Case Study: Xi'an Health‑Code Outage (Dec 2021)

Change Management (变更篇)

Reducing Change Impact

Intelligent Checking Tool

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Case Study: Xi'an Health‑Code Outage (Dec 2021)