From Ops to SRE: What Google’s Site Reliability Model Means for Your Team
The article reflects on the shift from traditional operations to Site Reliability Engineering (SRE), comparing Google’s SRE practices with those of a Chinese cloud provider, and explores infrastructure, tooling, team structure, and cultural challenges while drawing practical lessons for engineers.
Soil: Infrastructure
After years of discussion, the author’s company renamed the long‑standing "PE" title to "SRE" (distinguishing it from Google’s SRE). SRE, originating from Google’s Site Reliability Engineer role, is portrayed as a prestigious position that blends product design, high scalability, and reliability with operations responsibilities.
"We often talk about transformation to DevOps or SRE, but the real shift is that developers take over Ops work while Ops must find a new path; DevOps becomes a joke for Ops."
The author cites the book Google SRE as a comprehensive source for understanding the role.
1.1 Hardware Failures
In 2010, the author spent a day each week fixing hardware in a small data center, leading to two initiatives: automated repair of disks and memory, and fault prediction (though most faults were only detected after hours). These efforts reduced manual toil but raised questions about why monitoring programs remained silent when hardware hung.
"Why do machines/hard drives hang while your detection program does nothing?"
Google’s approach, as described in "The Data Center As a Computer," emphasizes daily cleaning of hardware and precise failure reports from Google System Health, reinforcing the principle “Tolerating faults, not hiding them.”
Google’s infrastructure team does extensive work, acting more as maintainers than participants in hardware failure handling.
1.2 Hardware Selection
Google avoids dedicated physical servers for specific software, giving SREs less involvement in hardware selection. In contrast, the author’s team must negotiate hardware choices with business owners, balancing cost, procurement cycles, and application compatibility.
1.3 Core Software Systems
1.3.1 Borg
Borg is not developed by the SRE team, yet SREs both use and maintain it. Early automation scripts (Python) handled service restarts, placement tracking, and log parsing.
"Initial automation gave us enough time to turn cluster management into an autonomous system rather than a scripted one." "Machine damage and lifecycle management no longer required any manual action."
1.3.2 BorgMon
Further monitoring tools built on Borg are mentioned.
1.3.3 Jupiter & BwE
Google’s Jupiter and BwE provide a massive virtual network with thousands of virtual ports and petabit‑scale bandwidth. While the author’s team does not directly maintain the network, they often play a decisive role in troubleshooting.
1.3.4 Storage Systems
The author’s cloud provider lacks a dedicated "D Service"; instead, storage is managed directly by a distributed file system (Pangu) with higher‑level services like MaxCompute, OTS, and OSS built on top.
1.3.5 Distributed Lock Service
Compared to Google’s Chubby, the author’s lock service operates only at the cluster level.
Capability: SRE‑Built Products
The author compares products explicitly developed by Google SREs with those in their own organization, noting that many ideas originate from SREs but are implemented by development teams.
2.1 Auxon
Both Google and the author’s SREs face capacity‑planning challenges. Google’s Auxon is an intent‑based planning tool, whereas the author’s organization relies on finance‑driven systems with less SRE involvement.
2.2 Layer‑3 Load Balancer
Google’s Maglev load balancer is highlighted as a core SRE skill, while the author’s experience is limited to commercial load balancers.
2.3 Sisyphus (Tesla)
The author’s team built an automated release platform called Tesla, similar in spirit to Google’s Sisyphus, providing deployment, upgrade, and notification capabilities.
2.4 Other Tools
Google’s Cron and WorkFlow are mentioned, though it is unclear if they were SRE‑authored. The author notes analogous internal services that have become “no‑man’s land” after organizational changes.
Methodology: Practices and Culture
3.1 Training Newcomers
The author’s SRE team relies heavily on on‑the‑job learning, with limited systematic training compared to Google’s structured programs.
3.2 Taking Over Services
When the team inherits a service, they improve architecture, monitoring, key metrics, incident handling, and documentation, often while the service is already in production.
3.3 Team Composition
Google’s SRE structure includes roles like Tech Lead, SRM, PM/TPM, while the author’s team blends traditional Ops with tech leads, product managers, data analysts, and operations staff, resulting in a more dynamic environment.
Conclusion
Reading the entire Google SRE book provides valuable case studies, such as handling unstable services, balancing short‑term interventions with root‑cause analysis, and recognizing that there are no shortcuts or one‑size‑fits‑all solutions in reliability engineering.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
