Mastering SRE: Key Questions on Monitoring, Capacity, and Change Management
This article provides a comprehensive SRE guide covering senior role definitions, monitoring objectives and implementation, core metric selection, link and event monitoring, capacity planning and mitigation strategies, a real‑world health‑code outage case, and change‑management best practices to improve reliability and efficiency.
Monitoring (监控篇)
Purpose : Detect anomalies quickly, locate the root cause, and resolve incidents promptly.
How to Add Monitoring
Provide a unified entry portal for all metrics.
Ensure core metrics have high precision and recall.
Instrument end‑to‑end business flows (upstream/downstream links).
Strengthen basic metrics and indicator coverage.
Choosing Core Metrics
Core metrics should directly reflect business health and guide problem discovery. Typical examples:
Search latency for a search engine.
Session duration for recommendation systems (e.g., Douyin, Kuaishou).
Upload/file count for storage services.
Transaction volume or amount for e‑commerce platforms.
Link (Path) Monitoring
A link represents a complete request flow, e.g., CDN → LVS → Nginx → Application Server → DB. Full‑path monitoring lets engineers view the entire chain, quickly isolate the failing segment, and reduce the time spent on business‑knowledge transfer.
Basic Monitoring
Basic host‑level metrics (CPU, memory, disk I/O, network I/O, process health) are essential for pinpointing the exact component that is malfunctioning during an incident.
Event Monitoring
Change events (deployment, configuration updates).
Operational events (service start/stop, scaling actions).
Network events (latency spikes, packet loss).
Alarm Optimization
Alarm merging : Consolidate alerts with the same root cause into a single notification.
Alarm escalation : Implement tiered alert levels so that critical alerts trigger faster, higher‑severity responses.
Capacity Planning (容量篇)
Purpose : Balance resource usage, system stability, and business growth to maximize traffic handling within limited infrastructure.
Measuring Capacity
Entry‑level services: measure by QPS (queries per second).
Internal services: measure primarily by CPU utilization (most services are CPU‑bound).
Data Sources
Capacity data is collected from load‑testing results, routine monitoring, and operational experience, then stored in a dedicated capacity‑management platform.
Handling Insufficient Capacity
Rapid scaling or auto‑scaling using cloud resources.
Rate limiting to protect downstream services.
Service degradation : keep only essential features (e.g., status indicators) while disabling non‑critical functionality.
Caching recent query results to reduce repeat load.
Case Study: Xi'an Health‑Code Outage (Dec 2021)
The system crashed when a sudden traffic surge exceeded its capacity, preventing users from scanning QR codes for nucleic‑acid testing.
Root cause : Unexpected load, insufficient scaling capacity, and lack of traffic diversion.
Mitigation steps :
Apply rate limiting to protect the service.
Scale out quickly using cloud resources.
Degrade non‑essential features, retaining only critical status information.
Cache recent queries to reduce repeat requests.
Change Management (变更篇)
Goal : Achieve a balance between deployment efficiency and system stability. Over 60 % of incidents are change‑related.
Reducing Change Impact
Standardized processes : Define graded release standards, time‑window policies, and approval workflows.
Shuttle‑bus mechanism : Limit core‑service releases to 1–2 times per day.
Approval windows : Typically 10 am–7 pm on weekdays.
Forced pause : Restrict instance usage to a configurable 30‑100 % based on business needs.
Manual checks : Attach relevant monitoring metrics to deployment tickets.
Fast rollback / traffic shifting : Immediately revert or reroute traffic when an anomaly is detected.
Automated checks : Batch‑verify key metrics after deployment.
Automatic fault handling : Auto‑remove faulty instances or shift traffic across zones without human intervention.
Intelligent Checking Tool
Graded releases require extensive metric validation, which can slow down deployments. An intelligent checking tool automates metric verification by applying default algorithms and incorporating upstream/downstream service awareness, reducing manual effort and improving release speed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
