Google's STAMP Framework: Redefining SRE for AI‑Driven Systems
Google’s SRE team is shifting from traditional error‑budget approaches to the STAMP (Systems-Theoretic Accident Model and Processes) framework, applying control theory and system‑level analysis to manage the growing complexity of AI‑powered services, improve safety, and proactively prevent hazardous states.
Billions of users rely on Google products daily, and the reliability of these services is critical. Over the past 25 years, Google’s SRE teams have engineered reliability across the stack, using SLOs, error budgets, isolation strategies, post‑mortems, progressive rollouts, and more.
With the rise of AI and ML systems, traditional SRE methods face new challenges. Concepts like SLOs and error budgets no longer suffice for zero‑tolerance failure scenarios, especially when privacy breaches, data loss, or regulatory compliance issues demand absolute prevention.
1. Limitations of Traditional SRE Methods
Ideas such as error budgets work well for stateless web services, but today’s products cannot tolerate any error budget. The types of failures we must prevent now exceed what error budgets can address. Issues like privacy leaks, data loss, and compliance require absolute prevention, not just low‑frequency, fast‑recovery handling.
Systems are becoming more complex each year. Automation enables scaling, while AI and ML are now core to almost every product, making cost and energy efficiency as important as user‑visible features.
SRE aims not only to respond to incidents but to anticipate and prevent them. In Google’s massive codebase, predicting failures is extremely difficult, and AI only adds to this challenge.
2. New Answer: STAMP Framework
The solution lies in a paradigm shift. System theory, control theory, and system‑thinking provide SRE with a way to understand and manage complexity at planetary scale. Google’s SRE has adopted the STAMP (Systems‑Theoretic Accident Model and Processes) framework developed by MIT professor Nancy Leveson.
STAMP shifts focus from preventing single‑component failures to understanding and managing complex system interactions. It includes tools such as CAST (Causal Analysis based on System Theory) for post‑incident investigation and STPA (System‑Theoretic Process Analysis) for hazard analysis. STAMP is based on control theory and views accidents as the result of inadequate or inappropriate control rather than merely component failures.
3. Four Basic Conditions of Control Theory
Goal Condition – The controller must have one or more goals (e.g., maintain a set point).
Action Condition – The controller must be able to affect the system’s state.
Model Condition – The controller must contain a model of the system.
Observability Condition – The controller must be able to determine the system’s state.
These four conditions provide a structured way to think about control in complex systems.
4. From Linear Causality to System Control
Traditional accident analysis treats interruptions as a linear chain of events. STAMP reframes accidents as control problems, asking not “which service failed?” but “what interaction between system parts lacked sufficient control?”
In complex systems, most accidents arise from interactions among components that individually operate as designed, yet collectively create an unsafe state.
5. Hazardous State: Giving Engineers More Time
STAMP formalizes the accident concept at the system level as a “hazardous state.” A hazardous state is a set of system conditions that, together with a worst‑case environment, can lead to loss.
Unlike discrete events, a hazardous state can persist for a long time before an accident occurs, giving engineers a larger window to intervene.
6. Real Case: 2021 Google Quota Adjuster Incident
In 2021 Google enforced resource quotas on internal services. When a service consistently used less than its quota, the adjuster automatically reduced the quota.
From an STPA perspective, this adjustment is a control action. The safety question is: when could this action become unsafe?
If the adjuster reduced a service’s quota below its actual demand, the service would be under‑provisioned – an unsafe condition.
STPA analyzes each interaction to determine how control must be exercised for safety. Unsafe control actions (UCAs) fall into four categories:
No required control action is provided.
An incorrect or insufficient control action is provided.
The control action is provided at the wrong time or in the wrong order.
The control action is stopped too early or applied for too long.
By modeling the system as a control‑feedback loop, we identified problems in both control and feedback paths.
7. Advantages of STPA
As STPA is applied to more systems, we see that feedback paths are often less understood than control paths, yet they are equally important for safety.
In the 2021 incident, incorrect feedback about resource usage was sent to the adjuster, which then allocated far fewer resources than needed. The hazardous state persisted for weeks because the quota reduction was not applied immediately, missing the chance to prevent loss.
Weeks later the reduction was applied, causing a major outage.
As Leveson writes in *Building a Safer World*, “In STAMP, understanding why an accident occurred requires determining why control was ineffective. Preventing future accidents shifts focus from preventing failures to designing and implementing controls that enforce necessary constraints.”
8. Looking Forward
Instead of viewing complexity as an error, Google’s SRE team is leveraging control theory, STPA, CAST, and related methods to move toward a more comprehensive, proactive reliability approach that designs safety from the ground up.
The evolution toward system‑safety methods gives engineers a new way to understand the systems they build and provides stronger guarantees about how those systems behave.
Complexity is everywhere, and Google’s engineers are preparing to meet it so that the next era can deliver the same exceptional performance.
Thought question: Does your system have similarly complex interactions, and are traditional SRE methods no longer sufficient for your reliability goals?
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
