11 Hard‑Earned Lessons from Two Decades of Google Site Reliability
Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.
Preface
Twenty years can bring massive change, especially when you are busy scaling. Two decades ago Google ran a small data center with a few thousand servers linked by 2.4 G network rings, managed by Python scripts and a tiny machine database (MDB). Over time the fleet and network grew thousands of times, while manual effort per server dropped and reliability improved.
1. Mitigate proportionally to incident severity
Choosing a mitigation that is riskier than the problem can cause cascading failures, as seen in a YouTube outage caused by an aggressive load‑shedding step that worsened the situation.
2. Fully test recovery mechanisms before emergencies
Practice and verify recovery procedures in advance so they meet needs and engineers know how to execute them, reducing risk during real incidents.
3. Canary all changes
Even seemingly harmless configuration changes can have unexpected impact; a gradual, canary‑style rollout could have limited a YouTube cache outage to a small subset before it spread globally.
4. Have a “big red emergency button”
Implement a simple, reliable way to abort dangerous operations and restore services quickly, ensuring every critical dependency has an emergency stop.
5. Unit tests are not enough—add integration tests
Unit tests verify individual components, but integration tests are needed to confirm that components work together in realistic environments, as a Calendar failure showed.
6. Maintain independent backup communication channels
Relying solely on services like Hangouts or Meet proved fragile during a massive logout event; non‑dependent backup channels must be tested and ready.
7. Intentional performance degradation mode
Graceful degradation provides a minimal functional experience during instability, improving overall user experience even when full performance cannot be maintained.
8. Test disaster‑resilience and recovery
Disaster‑resilience testing checks that services continue operating under failure, while recovery testing verifies they can return to normal after a shutdown.
9. Automate mitigation actions
When clear signals indicate a fault, automated mitigation can shorten MTTR, allowing engineers to focus on root‑cause analysis after the service is stabilized.
10. Shorten release intervals to reduce error risk
Frequent, well‑tested releases lower the chance of large‑scale failures, as illustrated by a payment system outage caused by a delayed rollout of a database field removal.
11. Avoid single‑point hardware versions
Using a single hardware model for critical functions simplifies operations but creates a single point of failure; diversity mitigates the impact of undiscovered vulnerabilities.
References
Wheel of Misfortune: https://sre.google/sre-book/accelerating-sre-on-call/
Service Best Practices guides: https://sre.google/sre-book/service-best-practices/
Canary releases: https://sre.google/workbook/canarying-releases/
More on canary strategy: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/24017e52c907294589604a29a86f158828eda078.pdf
Video presentation: https://www.usenix.org/conference/srecon18europe/presentation/davidovic
Generic mitigations: https://www.oreilly.com/content/generic-mitigations/
Graceful degradation: (source)
Resilience article: https://queue.acm.org/detail.cfm?id=2371516
Frequent releases: https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.