Operations 14 min read

Why SRE Matters: Bridging Product Development and Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.

Efficient Ops

Mar 28, 2023

Why SRE Matters: Bridging Product Development and Reliability Engineering

Preface

In technical work, product/foundation technology development and SRE are often distinguished by whether the role focuses on coding. When product developers move to SRE, they wonder if they must abandon coding or deviate from product/technology advancement.

Based on experience in development and reliability, I share my understanding of SRE and discuss the collaboration between product/foundation development and reliability assurance to better serve the business.

SRE Overview

The concept of SRE originates from Google's book "Site Reliability Engineering: How Google Runs Production Systems". Google SRE members describe how they take a holistic view of the software lifecycle and why this helps Google build, deploy, monitor, and operate some of the largest software systems.

Quote: "Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems."

Another vivid description: "SRE is what happens when a software engineer is tasked with what used to be called operations."

The goal of SRE is to build scalable, highly‑available systems by applying software‑engineering methods to infrastructure and operational problems.

In the Google SRE book, a typical work split is up to 50 % on operational tasks and more than 50 % on engineering work that ensures infrastructure stability and scalability.

Responsibility: Ensure infrastructure stability and scalability

Core: Problem solving

Method: Accumulate experience from operational incidents and improve efficiency through coding

Software Lifecycle

Google SRE likens software engineering to raising a child: the development phase is painful, but the majority of effort is spent on nurturing the system after it is built.

40 %–90 % of a system's cost is incurred after it goes live.

Project phases typically allocate less effort to design and construction than to post‑launch maintenance. Two role types are needed:

Focus on designing and building the software system (product/foundation development)

Focus on the entire system lifecycle, from design to deployment, continuous improvement, and eventual decommission (SRE)

Both share the common goal of achieving project objectives and serving the business.

Value of Stability Assurance

Direct involvement in customer incidents makes the impact of stability tangible:

Feel the anxiety caused by incidents through customer feedback on severity and urgency.

Experience gratitude or anger after issue resolution.

Observe revenue impact and customer base changes after incidents.

Notice delays in product planning caused by stability problems.

Thus stability assurance delivers:

Reliable product experience meeting customer expectations.

Accelerated business iteration by allowing teams to focus on delivering new features.

How SRE Ensures Stability

Stability issues often have these characteristics:

Human‑induced, relying on expert experience.

Result from a combination of factors.

Inevitable.

Full 100 % guarantee is unnecessary.

In practice, most online stability problems stem from improper releases and operational actions, which heavily depend on expert knowledge.

These problems are systemic, caused by missing monitoring, logging, troubleshooting processes, or poor coordination, leading to longer resolution times and greater customer impact.

Issues are unavoidable—traffic spikes, hardware failures, uncovered inputs, etc.

Businesses often have SLAs; failing to meet them triggers compensation, yet striving for higher stability beyond internal SLOs yields diminishing returns.

SRE must deeply understand problem characteristics, design systematic solutions, and focus on resolving the most frequent issues.

A reference solution includes three pillars:

Implementation can start with three actionable areas:

Controllability

Observability

Stability‑assurance best practices

Controllability includes three dimensions:

Release Management – address human‑caused stability issues during releases, including pre‑change reviews and in‑release change control.

Operation Management – mitigate stability problems caused by manual operations, with unified cluster operation entry, permission management, and audit.

Design Review – embed stability best practices during system design, covering cluster solution reviews and critical feature design reviews.

Observability covers several key dimensions:

Monitoring – improve perception of system runtime, build and maintain collection/visualization systems.

Logging – enhance problem traceability, build and maintain log collection, storage, query, and analysis systems.

Inspection – proactively probe system functionality, build inspection services and common inspection logic.

Alerting – ensure timely notification of anomalies, build and manage alert systems, configurations, channels, and analysis.

Stability‑assurance best practices are abstracted from historical issues and industry experience, forming awareness, processes, standards, and tools that are embedded from system design onward and applied throughout the lifecycle, e.g., via templated checklists.

Project quality acceptance criteria

Project safety production standards

Pre‑release checklist

Tech Review template

Kick‑off template

Project management guidelines

etc.

Example checklist grading:

When best practices are documented, tools or services can apply them at low cost, turning stability assurance into infrastructure.

SRE continuously iterates methodology and practice, designing top‑down and feeding back bottom‑up to reliably guarantee stability.

Win‑Win Collaboration

Product/foundation development: focus on designing and building software systems.

SRE: focus on the entire system lifecycle, from design to deployment, continuous improvement, and decommission.

These roles cooperate to meet business needs and create greater value.

SRE supports multiple projects, gaining a broad view of incident types and solutions, which it distills into theories, tools, and services that aid development and can be productized for wider customers.

Product/foundation developers bring deep business and technical knowledge, delivering direct business value and informing stability requirements, which SRE then helps to realize.

Both roles must work side‑by‑side toward the shared goal of business success.

Conclusion

SRE serves many businesses horizontally, accumulating deep insight into stability challenges, while vertically it embeds best‑practice techniques throughout the system lifecycle; together with development it creates value through both technology and management.

The key is to solve problems and create greater value .

References

Douban SRE: https://book.douban.com/subject/26875239/

Wikipedia: Site reliability engineering – https://en.wikipedia.org/wiki/Site_reliability_engineering

Wikipedia: Controllability – https://en.wikipedia.org/wiki/Controllability

Wikipedia: Observability – https://en.wikipedia.org/wiki/Observability

Google SRE site – https://sre.google/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Observability SRE Reliability Engineering Software Lifecycle

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.