Operations 4 min read

Why Preventing Small Issues Is the Key to System Stability

The article explains how early detection and preventive measures—such as comprehensive monitoring, rate limiting, chaos testing, and proper SLOs—are essential for maintaining system stability and avoiding larger incidents, drawing on SRE principles and the incident triangle theory.

Tech Architecture Stories

Dec 28, 2024

Why Preventing Small Issues Is the Key to System Stability

“有不尽者，亦宜防微杜渐，而禁于未然。” —《元史·列传·卷七十三·张桢传》

Interpretation: Even potential problems that have not yet manifested should be pre‑emptively guarded against.

In daily stability and micro‑service governance work, we apply pre‑, during‑, and post‑incident measures, primarily starting with comprehensive monitoring and alerting, followed by throttling, circuit breaking, load testing, full‑chain testing, chaos injection, and disaster‑recovery drills such as AZ or data‑center failover exercises.

As preventive actions increase, they become heavier because the goal is to avoid issues that could attract negative attention, embodying the principle “prevent small issues early and stop them before they happen.”

Prevent Small Issues Early

When any early sign appears—like an alarm, rising RPC failure rate, or increasing error logs—we must investigate the root cause and resolve it at the source instead of ignoring it, following the industrial “incident triangle” theory.

One major accident, 29 minor injuries, and 300 no‑injury incidents are interrelated.

Reducing minor incidents also reduces the number of major accidents.

As discussed in the article “The Essence and Underlying Logic of Incident Post‑mortems,” a key action item in the Google SRE Postmortem Checklist is “Prevent” (alongside “Mitigation”).

Therefore, in routine stability governance and post‑mortems, especially day‑to‑day, preventing small issues is crucial; no warning sign should be ignored.

From personal experience, a common trap is that regular alarm reviews surface recurring problems that cannot be fully eliminated, leading to team fatigue and extra burden. The solution lies in setting more reasonable SLOs and error budgets.

Prevent Before It Happens

Beyond early prevention, pure preventive actions—assuming they will avoid failures or enable rapid loss mitigation, recovery, or even self‑healing—are increasingly emphasized.

Reference Links

[1]

Incident Triangle: https://zh.wikipedia.org/wiki/%E4%BA%8B%E6%95%85%E4%B8%89%E8%A7%92 [2] Google SRE Postmortem Checklist: https://docs.google.com/document/d/1iaEgF0ICSmKKLG3_BT5VnK80gfOenhhmxVnnUcNSQBE

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

operations SRE Incident Prevention Error Budget

Written by

Tech Architecture Stories

Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.