Why SRE Matters: Bridging Product Development and Reliability Engineering
This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.
Preface
In technical work, product/foundation technology development and SRE are often distinguished by whether the role focuses on coding. When product developers move to SRE, they wonder if they must abandon coding or deviate from product/technology advancement.
Based on experience in development and reliability, I share my understanding of SRE and discuss the collaboration between product/foundation development and reliability assurance to better serve the business.
SRE Overview
The concept of SRE originates from Google's book "Site Reliability Engineering: How Google Runs Production Systems". Google SRE members describe how they take a holistic view of the software lifecycle and why this helps Google build, deploy, monitor, and operate some of the largest software systems.
Quote: "Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems."
Another vivid description: "SRE is what happens when a software engineer is tasked with what used to be called operations."
The goal of SRE is to build scalable, highly‑available systems by applying software‑engineering methods to infrastructure and operational problems.
In the Google SRE book, a typical work split is up to 50 % on operational tasks and more than 50 % on engineering work that ensures infrastructure stability and scalability.
Responsibility: Ensure infrastructure stability and scalability
Core: Problem solving
Method: Accumulate experience from operational incidents and improve efficiency through coding
Software Lifecycle
Google SRE likens software engineering to raising a child: the development phase is painful, but the majority of effort is spent on nurturing the system after it is built.
40 %–90 % of a system's cost is incurred after it goes live.
Project phases typically allocate less effort to design and construction than to post‑launch maintenance. Two role types are needed:
Focus on designing and building the software system (product/foundation development)
Focus on the entire system lifecycle, from design to deployment, continuous improvement, and eventual decommission (SRE)
Both share the common goal of achieving project objectives and serving the business.
Value of Stability Assurance
Direct involvement in customer incidents makes the impact of stability tangible:
Feel the anxiety caused by incidents through customer feedback on severity and urgency.
Experience gratitude or anger after issue resolution.
Observe revenue impact and customer base changes after incidents.
Notice delays in product planning caused by stability problems.
Thus stability assurance delivers:
Reliable product experience meeting customer expectations.
Accelerated business iteration by allowing teams to focus on delivering new features.
How SRE Ensures Stability
Stability issues often have these characteristics:
Human‑induced, relying on expert experience.
Result from a combination of factors.
Inevitable.
Full 100 % guarantee is unnecessary.
In practice, most online stability problems stem from improper releases and operational actions, which heavily depend on expert knowledge.
These problems are systemic, caused by missing monitoring, logging, troubleshooting processes, or poor coordination, leading to longer resolution times and greater customer impact.
Issues are unavoidable—traffic spikes, hardware failures, uncovered inputs, etc.
Businesses often have SLAs; failing to meet them triggers compensation, yet striving for higher stability beyond internal SLOs yields diminishing returns.
SRE must deeply understand problem characteristics, design systematic solutions, and focus on resolving the most frequent issues.
A reference solution includes three pillars:
Implementation can start with three actionable areas:
Controllability
Observability
Stability‑assurance best practices
Controllability includes three dimensions:
Release Management – address human‑caused stability issues during releases, including pre‑change reviews and in‑release change control.
Operation Management – mitigate stability problems caused by manual operations, with unified cluster operation entry, permission management, and audit.
Design Review – embed stability best practices during system design, covering cluster solution reviews and critical feature design reviews.
Observability covers several key dimensions:
Monitoring – improve perception of system runtime, build and maintain collection/visualization systems.
Logging – enhance problem traceability, build and maintain log collection, storage, query, and analysis systems.
Inspection – proactively probe system functionality, build inspection services and common inspection logic.
Alerting – ensure timely notification of anomalies, build and manage alert systems, configurations, channels, and analysis.
Stability‑assurance best practices are abstracted from historical issues and industry experience, forming awareness, processes, standards, and tools that are embedded from system design onward and applied throughout the lifecycle, e.g., via templated checklists.
Project quality acceptance criteria
Project safety production standards
Pre‑release checklist
Tech Review template
Kick‑off template
Project management guidelines
etc.
Example checklist grading:
When best practices are documented, tools or services can apply them at low cost, turning stability assurance into infrastructure.
SRE continuously iterates methodology and practice, designing top‑down and feeding back bottom‑up to reliably guarantee stability.
Win‑Win Collaboration
Product/foundation development: focus on designing and building software systems.
SRE: focus on the entire system lifecycle, from design to deployment, continuous improvement, and decommission.
These roles cooperate to meet business needs and create greater value.
SRE supports multiple projects, gaining a broad view of incident types and solutions, which it distills into theories, tools, and services that aid development and can be productized for wider customers.
Product/foundation developers bring deep business and technical knowledge, delivering direct business value and informing stability requirements, which SRE then helps to realize.
Both roles must work side‑by‑side toward the shared goal of business success.
Conclusion
SRE serves many businesses horizontally, accumulating deep insight into stability challenges, while vertically it embeds best‑practice techniques throughout the system lifecycle; together with development it creates value through both technology and management.
The key is to solve problems and create greater value .
References
Douban SRE: https://book.douban.com/subject/26875239/
Wikipedia: Site reliability engineering – https://en.wikipedia.org/wiki/Site_reliability_engineering
Wikipedia: Controllability – https://en.wikipedia.org/wiki/Controllability
Wikipedia: Observability – https://en.wikipedia.org/wiki/Observability
Google SRE site – https://sre.google/
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.