Tag

Site Reliability Engineering

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Apr 8, 2024 · Operations

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

OncallSRESite Reliability Engineering
0 likes · 22 min read
What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices
DevOps
DevOps
Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

DevOpsGoogleSLI
0 likes · 4 min read
An Overview of the Google SRE Workbook and Core SRE Foundations
Efficient Ops
Efficient Ops
May 21, 2023 · Operations

From Apollo to Google: How Margaret Hamilton Shaped Modern SRE

This article traces the origins of Site Reliability Engineering from Margaret Hamilton’s pioneering work on the Apollo program, through Google’s formal SRE team creation, and highlights the key differences between SRE and traditional operations practices.

GoogleMargaret HamiltonSRE
0 likes · 7 min read
From Apollo to Google: How Margaret Hamilton Shaped Modern SRE
DevOps Cloud Academy
DevOps Cloud Academy
May 10, 2023 · Operations

Understanding the Role of Site Reliability Engineering (SRE) in DevOps

This article explains why Site Reliability Engineering (SRE) and DevOps are both essential for modern software development, compares their objectives, outlines their complementary roles, and highlights the fundamental differences that help organizations achieve faster releases with higher reliability.

DevOpsSRESite Reliability Engineering
0 likes · 8 min read
Understanding the Role of Site Reliability Engineering (SRE) in DevOps
Efficient Ops
Efficient Ops
Feb 7, 2023 · Operations

Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights

Site Reliability Engineering (SRE), introduced by Google in 2003, has become a cornerstone for ensuring the reliability and stability of large‑scale internet platforms, and Chinese experts now share home‑grown practices and a new book that distills two decades of SRE experience for building high‑availability applications.

BookDevOpsReliability
0 likes · 3 min read
Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights
DevOps Cloud Academy
DevOps Cloud Academy
Dec 31, 2022 · Operations

Google Site Reliability Engineering (SRE) Principles and Engagement Model

The article explains Google’s Site Reliability Engineering (SRE) team, its mission to balance reliability and velocity through automation, the engagement model with development teams, funding principles, and a set of guiding principles that shape how SRE collaborates, scopes, and delivers value across services.

Engagement ModelGoogleReliability
0 likes · 29 min read
Google Site Reliability Engineering (SRE) Principles and Engagement Model
Bilibili Tech
Bilibili Tech
Aug 12, 2022 · Operations

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.

SLOSite Reliability Engineeringalerting
0 likes · 18 min read
SLO Implementation and Alerting Strategies – Bilibili SRE Practices
DevOps
DevOps
Jul 8, 2022 · Operations

Nine Essential Skills Every Modern Site Reliability Engineer Should Master

The article outlines the nine core competencies—network expertise, Linux/Unix knowledge, cloud computing, CI/CD pipelines, QA automation, security engineering, DevOps, incident management, and post‑incident review—that enable SREs to ensure the availability, performance, and reliability of complex distributed systems.

DevOpsSRESite Reliability Engineering
0 likes · 6 min read
Nine Essential Skills Every Modern Site Reliability Engineer Should Master
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

OncallSLISLO
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
Architect
Architect
Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallSLI/SLOSRE
0 likes · 22 min read
A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices
IT Architects Alliance
IT Architects Alliance
Apr 12, 2022 · Operations

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

This article explains Site Reliability Engineering (SRE), covering its origins, core responsibilities, key concepts such as SLI/SLO/SLA and error budgets, the four golden monitoring metrics, risk analysis, and practical guidance on building reliable services using tools like Prometheus and Grafana.

Error BudgetSLISLO
0 likes · 15 min read
Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices
ByteDance ADFE Team
ByteDance ADFE Team
Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

Error BudgetSLISLO
0 likes · 7 min read
From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
Efficient Ops
Efficient Ops
Mar 31, 2021 · Operations

Top 7 SRE Interview Questions Every Candidate Should Master

This article outlines the seven most important Site Reliability Engineering interview questions, explains why they matter, and provides an overview of the upcoming SRE Foundation course that equips professionals with the principles, practices, and tools needed for reliable, scalable systems.

Interview QuestionsSRESRE Foundation
0 likes · 9 min read
Top 7 SRE Interview Questions Every Candidate Should Master
Architects Research Society
Architects Research Society
Mar 30, 2021 · Operations

Understanding Site Reliability Engineering (SRE): Roles, Tools, and Practices

The article provides a comprehensive overview of Site Reliability Engineering (SRE), explaining its origins, definition by Google, required skill sets, typical responsibilities, tools used, and how the role has evolved within DevOps and modern cloud‑native environments.

DevOpsReliabilitySRE
0 likes · 9 min read
Understanding Site Reliability Engineering (SRE): Roles, Tools, and Practices
DevOps
DevOps
Mar 18, 2021 · Operations

Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability

Site Reliability Engineering (SRE) combines software engineering with operations to ensure scalable, highly reliable systems, outlining the collaboration between product development and SRE roles, the software lifecycle, stability value, and practical frameworks for observability, controllability, and best‑practice implementation.

ObservabilitySRESite Reliability Engineering
0 likes · 12 min read
Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability
Efficient Ops
Efficient Ops
Jan 5, 2021 · Operations

Master Site Reliability Engineering: Inside the SRE Foundation Course

The SRE Foundation course introduces site reliability engineering principles, practices, and tools, explaining why perfect reliability is impractical, outlining SRE responsibilities, detailing the curriculum across eight modules, and identifying the diverse professionals—from engineers to managers—who can benefit from mastering reliability, scalability, and automation.

CourseReliabilitySRE
0 likes · 7 min read
Master Site Reliability Engineering: Inside the SRE Foundation Course
Efficient Ops
Efficient Ops
Nov 4, 2020 · Operations

Unlocking SRE: Foundations, Principles, and Career Paths Explained

This article clarifies common misconceptions about Site Reliability Engineering, outlines the role’s responsibilities, presents the SRE Foundation course syllabus and target audience, and highlights the GOPS 2020 Global Operations Conference where the training is offered.

DevOpsReliabilitySRE
0 likes · 7 min read
Unlocking SRE: Foundations, Principles, and Career Paths Explained
Efficient Ops
Efficient Ops
Aug 23, 2020 · Operations

Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020

The SRE Foundation course presented at the GOPS 2020 Global Operations Conference in Shenzhen introduces core Site Reliability Engineering principles, practical tools, and certification preparation through eight detailed modules, targeting a wide range of IT professionals and business stakeholders.

DevOpsSRESite Reliability Engineering
0 likes · 6 min read
Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020
DevOps
DevOps
Aug 13, 2020 · Operations

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

This article outlines ByteDance’s adoption of chaos engineering, describing its background, industry examples, the evolution of internal fault‑injection platforms across three generations, the fault model and center design, experiment principles, and future plans for infrastructure‑level chaos and automated diagnostics.

Chaos EngineeringFault InjectionObservability
0 likes · 21 min read
ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions
Efficient Ops
Efficient Ops
Jul 28, 2020 · Operations

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

This article details Zhejiang Mobile's adaptation of Google‑originated Site Reliability Engineering to a telecom environment, outlining a three‑layer capability framework, standardized processes, integrated platforms, and measurable outcomes that demonstrate how agile SRE practices can boost reliability and scalability in traditional industries.

AgileSRESite Reliability Engineering
0 likes · 11 min read
How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint