Tagged articles

Site Reliability Engineering

47 articles · Page 1 of 1

Jun 15, 2026 · Operations

When AI Generates Code 10× Faster, Who Safeguards System Reliability?

The article analyzes Google’s SRE whitepaper on AI‑driven operations, detailing how generative AI accelerates code production 4‑10×, introduces five SRE AI autonomy levels, three core AI‑ops components, and a safety architecture that decouples decision‑making from execution to prevent catastrophic failures.

AI OpsAutomationGoogle

0 likes · 12 min read

When AI Generates Code 10× Faster, Who Safeguards System Reliability?

DevOps Coach

Dec 29, 2025 · Operations

Mastering System Reliability: Lessons from Google, Netflix, and Meta

Learn how Google, Netflix, and Meta pioneered modern reliability practices—SRE’s data‑driven metrics, Netflix’s chaos engineering, and Meta’s self‑healing automation—and get a step‑by‑step handbook to apply these concepts, avoid common traps, and build resilient systems at any scale.

AutomationSite Reliability Engineeringchaos engineering

0 likes · 11 min read

Mastering System Reliability: Lessons from Google, Netflix, and Meta

Continuous Delivery 2.0

Oct 13, 2025 · Operations

How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard

This article traces Google Site Reliability Engineering from its 2003 inception addressing scale crises, through organizational growth, core principles, team structures, and recent security integrations, showing how SRE transformed operations into a software‑engineering discipline that drives reliable, scalable digital services.

Error BudgetGoogleOperations

0 likes · 13 min read

How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard

Continuous Delivery 2.0

Oct 11, 2025 · Operations

Mastering Enterprise SRE: From Core Concepts to Practical Implementation

This comprehensive guide explains the core principles of Site Reliability Engineering, outlines a phased roadmap for enterprise adoption, details essential monitoring, automation, and reliability platforms, and addresses team structure, talent development, common challenges, and real‑world success stories to help organizations build effective SRE practices.

AutomationSRESite Reliability Engineering

0 likes · 16 min read

Mastering Enterprise SRE: From Core Concepts to Practical Implementation

MaGe Linux Operations

Sep 28, 2025 · Operations

What Core Skills Do SRE Engineers Need to Master?

This article outlines the essential technical, incident‑response, reliability‑management, collaboration, and systemic‑thinking abilities that Site Reliability Engineering (SRE) professionals must develop to ensure high‑availability, stable services in modern internet environments.

Incident ManagementSRESite Reliability Engineering

0 likes · 5 min read

What Core Skills Do SRE Engineers Need to Master?

MaGe Linux Operations

Dec 26, 2024 · Operations

What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution

This article provides a comprehensive overview of internet operations, detailing the three core pillars of service‑centered stability, security, and efficiency, describing the classification of operation roles, their responsibilities, the evolution of operational practices, and practical advice for aspiring operation engineers.

Site Reliability Engineeringinfrastructure

0 likes · 20 min read

What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution

Efficient Ops

Apr 8, 2024 · Operations

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

OncallOperationsSRE

0 likes · 22 min read

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

DeWu Technology

Dec 8, 2023 · Operations

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

On November 25, Dewu Technology hosted an SRE Stability Engineering salon in Hangzhou where experts from Alibaba, Tencent, Ant Group and Dewu shared practical insights on C‑end link reliability, Alibaba’s system stability operations, Tencent Game’s cloud‑native SRE practices, and Ant Group’s chaos engineering, concluding with a Q&A and resource distribution.

Cloud NativeOperationsSRE

0 likes · 7 min read

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

DevOps

Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

GoogleSLISLO

0 likes · 4 min read

An Overview of the Google SRE Workbook and Core SRE Foundations

Tech Architecture Stories

Jul 23, 2023 · Operations

Why Every Backend Engineer Should Read Google’s SRE Handbook

The article recommends two essential Google SRE books for backend developers, explains what SRE is, how it differs from traditional operations, and shows how the concepts like SLI/SLO, incident postmortems, and reliability engineering can be applied to improve system availability and stability.

Backend DevelopmentOperationsSRE

0 likes · 4 min read

Why Every Backend Engineer Should Read Google’s SRE Handbook

Efficient Ops

May 21, 2023 · Operations

From Apollo to Google: How Margaret Hamilton Shaped Modern SRE

This article traces the origins of Site Reliability Engineering from Margaret Hamilton’s pioneering work on the Apollo program, through Google’s formal SRE team creation, and highlights the key differences between SRE and traditional operations practices.

GoogleMargaret HamiltonOperations

0 likes · 7 min read

From Apollo to Google: How Margaret Hamilton Shaped Modern SRE

DevOps Cloud Academy

May 10, 2023 · Operations

Understanding the Role of Site Reliability Engineering (SRE) in DevOps

This article explains why Site Reliability Engineering (SRE) and DevOps are both essential for modern software development, compares their objectives, outlines their complementary roles, and highlights the fundamental differences that help organizations achieve faster releases with higher reliability.

DevOpsSRESite Reliability Engineering

0 likes · 8 min read

Understanding the Role of Site Reliability Engineering (SRE) in DevOps

Efficient Ops

Feb 7, 2023 · Operations

Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights

Site Reliability Engineering (SRE), introduced by Google in 2003, has become a cornerstone for ensuring the reliability and stability of large‑scale internet platforms, and Chinese experts now share home‑grown practices and a new book that distills two decades of SRE experience for building high‑availability applications.

BookDevOpsOperations

0 likes · 3 min read

Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights

DevOps Cloud Academy

Dec 31, 2022 · Operations

Google Site Reliability Engineering (SRE) Principles and Engagement Model

The article explains Google’s Site Reliability Engineering (SRE) team, its mission to balance reliability and velocity through automation, the engagement model with development teams, funding principles, and a set of guiding principles that shape how SRE collaborates, scopes, and delivers value across services.

Engagement ModelGoogleReliability

0 likes · 29 min read

Google Site Reliability Engineering (SRE) Principles and Engagement Model

Bilibili Tech

Aug 12, 2022 · Operations

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.

AlertingMetricsMonitoring

0 likes · 18 min read

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

DevOps

Jul 8, 2022 · Operations

Nine Essential Skills Every Modern Site Reliability Engineer Should Master

The article outlines the nine core competencies—network expertise, Linux/Unix knowledge, cloud computing, CI/CD pipelines, QA automation, security engineering, DevOps, incident management, and post‑incident review—that enable SREs to ensure the availability, performance, and reliability of complex distributed systems.

Cloud ComputingDevOpsSRE

0 likes · 6 min read

Nine Essential Skills Every Modern Site Reliability Engineer Should Master

Architecture Talk

Jun 27, 2022 · Operations

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

This article explains the motivations behind Site Reliability Engineering (SRE), outlines its strategic goals, defines key concepts such as SLI, SLO, SLA and error budget, introduces the four golden metrics for monitoring distributed systems, and provides practical guidance on building, operating, and continuously improving an SRE practice.

Error BudgetMonitoringSLI

0 likes · 14 min read

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

IT Architects Alliance

Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI

0 likes · 21 min read

Understanding the SRE Role: Responsibilities, Types, and Practices

Architect

Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

MonitoringOncallOperations

0 likes · 22 min read

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

IT Architects Alliance

Apr 12, 2022 · Operations

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

This article explains Site Reliability Engineering (SRE), covering its origins, core responsibilities, key concepts such as SLI/SLO/SLA and error budgets, the four golden monitoring metrics, risk analysis, and practical guidance on building reliable services using tools like Prometheus and Grafana.

Error BudgetMonitoringOperations

0 likes · 15 min read

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

IT Architects Alliance

Dec 1, 2021 · Operations

What Does an SRE Actually Do? A Deep Dive into Roles and Practices

This article explains the origins of Site Reliability Engineering, breaks down its three main layers—Infrastructure, Platform, and Business SRE—covers day‑one and day‑2 deployment, on‑call processes, SLI/SLO design, post‑mortems, capacity planning, user support, and offers practical advice for aspiring SREs.

OncallOperationsSLI

0 likes · 24 min read

What Does an SRE Actually Do? A Deep Dive into Roles and Practices

Programmer DD

Nov 16, 2021 · Operations

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its origins at Google, the challenges of hiring, the three-layer model of infrastructure, platform, and business SRE, and provides detailed responsibilities, on‑call practices, SLI/SLO management, capacity planning, and career advice for aspiring SREs.

OncallPlatformSLI

0 likes · 23 min read

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

ByteDance ADFE Team

Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

AlertingError BudgetSLI

0 likes · 7 min read

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

Efficient Ops

Mar 31, 2021 · Operations

Understanding Site Reliability Engineering (SRE): Roles, Tools, and Practices

The article provides a comprehensive overview of Site Reliability Engineering (SRE), explaining its origins, definition by Google, required skill sets, typical responsibilities, tools used, and how the role has evolved within DevOps and modern cloud‑native environments.

DevOpsOperationsReliability

0 likes · 9 min read

Understanding Site Reliability Engineering (SRE): Roles, Tools, and Practices

DevOps

Mar 18, 2021 · Operations

Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability

Site Reliability Engineering (SRE) combines software engineering with operations to ensure scalable, highly reliable systems, outlining the collaboration between product development and SRE roles, the software lifecycle, stability value, and practical frameworks for observability, controllability, and best‑practice implementation.

SRESite Reliability EngineeringSoftware Lifecycle

0 likes · 12 min read

Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability

21CTO

Feb 3, 2021 · Operations

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

This article explains the role of Site Reliability Engineering (SRE) in bridging product and foundational technology development, outlines the software lifecycle, describes how SRE ensures system stability through controllability, observability, and protection, and provides practical best‑practice checklists and maturity levels for evaluating and improving reliability.

ObservabilityOperationsSRE

0 likes · 13 min read

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

Efficient Ops

Jan 5, 2021 · Operations

Master Site Reliability Engineering: Inside the SRE Foundation Course

The SRE Foundation course introduces site reliability engineering principles, practices, and tools, explaining why perfect reliability is impractical, outlining SRE responsibilities, detailing the curriculum across eight modules, and identifying the diverse professionals—from engineers to managers—who can benefit from mastering reliability, scalability, and automation.

CourseReliabilitySRE

0 likes · 7 min read

Master Site Reliability Engineering: Inside the SRE Foundation Course

21CTO

Jan 2, 2021 · Operations

Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets

This article presents a comprehensive overview of Site Reliability Engineering (SRE) as shared by Google SRE expert Ramón Medrano Llamas, covering SRE fundamentals, a typical day’s workflow, design principles for massive scale, fault‑tolerant architecture, monitoring, SLI/SLO metrics, redundancy strategies, disaster recovery, and operational best practices.

OperationsSREScalable Systems

0 likes · 13 min read

Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets

Efficient Ops

Nov 4, 2020 · Operations

Unlocking SRE: Foundations, Principles, and Career Paths Explained

This article clarifies common misconceptions about Site Reliability Engineering, outlines the role’s responsibilities, presents the SRE Foundation course syllabus and target audience, and highlights the GOPS 2020 Global Operations Conference where the training is offered.

DevOpsReliabilitySRE

0 likes · 7 min read

Unlocking SRE: Foundations, Principles, and Career Paths Explained

Efficient Ops

Aug 23, 2020 · Operations

Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020

The SRE Foundation course presented at the GOPS 2020 Global Operations Conference in Shenzhen introduces core Site Reliability Engineering principles, practical tools, and certification preparation through eight detailed modules, targeting a wide range of IT professionals and business stakeholders.

DevOpsSRESite Reliability Engineering

0 likes · 6 min read

Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020

DevOps

Aug 13, 2020 · Operations

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

This article outlines ByteDance’s adoption of chaos engineering, describing its background, industry examples, the evolution of internal fault‑injection platforms across three generations, the fault model and center design, experiment principles, and future plans for infrastructure‑level chaos and automated diagnostics.

Fault InjectionObservabilityReliability

0 likes · 21 min read

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

Efficient Ops

Jul 28, 2020 · Operations

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

This article details Zhejiang Mobile's adaptation of Google‑originated Site Reliability Engineering to a telecom environment, outlining a three‑layer capability framework, standardized processes, integrated platforms, and measurable outcomes that demonstrate how agile SRE practices can boost reliability and scalability in traditional industries.

AgileSRESite Reliability Engineering

0 likes · 11 min read

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

dbaplus Community

Jul 13, 2020 · Operations

14 Expert Q&A on Building an Effective SRE System for Fault Management

In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.

DevOpsSRESite Reliability Engineering

0 likes · 10 min read

14 Expert Q&A on Building an Effective SRE System for Fault Management

dbaplus Community

Dec 30, 2019 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

This article explains the origins of Site Reliability Engineering (SRE), outlines the responsibilities of SRE teams, and details Alibaba Cloud’s ECS SRE practices—including capacity planning, performance optimization, full‑stack stability governance, automated release pipelines, on‑call processes, and the core principles and mindset that guide modern SRE work.

AutomationCloud ComputingOperations

0 likes · 28 min read

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

Alibaba Cloud Developer

Nov 29, 2019 · Operations

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

This article explains how Alibaba's Elastic Compute Service (ECS) SRE team tackled massive traffic, database bottlenecks, alert overload, and resource inconsistencies by establishing a full‑stack reliability organization, upgrading core components, automating pipelines, and instituting rigorous monitoring, incident response, and change‑management processes.

OperationsSRESite Reliability Engineering

0 likes · 27 min read

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

Efficient Ops

Jan 29, 2018 · Operations

Will Operations Be Replaced? Exploring the Role, Skills, and Future of DevOps

This article demystifies operations engineering by explaining its core responsibilities, addressing common myths about being replaced by cloud or DevOps, outlining the product lifecycle handoff, comparing Ops engineers with Ops developers, and proposing a skill‑level framework to guide career growth.

DevOpsSite Reliability Engineeringskill assessment

0 likes · 12 min read

Will Operations Be Replaced? Exploring the Role, Skills, and Future of DevOps

MaGe Linux Operations

Nov 26, 2017 · Operations

What Google’s SRE Reveals About Modern Operations and SLO Design

This article shares key insights from the book “SRE Google Operations Unveiled,” explaining Google’s infrastructure, the role of SRE, and how Service Level Objectives (SLOs) help balance reliability, cost, and innovation in modern operations.

GoogleSLOSRE

0 likes · 9 min read

What Google’s SRE Reveals About Modern Operations and SLO Design

Architects Research Society

Oct 20, 2017 · Operations

Understanding Site Reliability Engineering (SRE): Definitions, Tools, Roles, and Evolution

The article explains Site Reliability Engineering (SRE) as a discipline that blends software engineering with operations, detailing its origins, key responsibilities, required skill sets, tools, impact on reliability and downtime costs, and how the role has evolved with modern cloud and DevOps practices.

DevOpsReliabilitySRE

0 likes · 9 min read

Understanding Site Reliability Engineering (SRE): Definitions, Tools, Roles, and Evolution

Efficient Ops

Sep 27, 2017 · Operations

From Ops to SRE: What Google’s Site Reliability Model Means for Your Team

The article reflects on the shift from traditional operations to Site Reliability Engineering (SRE), comparing Google’s SRE practices with those of a Chinese cloud provider, and explores infrastructure, tooling, team structure, and cultural challenges while drawing practical lessons for engineers.

DevOpsGoogleSRE

0 likes · 19 min read

From Ops to SRE: What Google’s Site Reliability Model Means for Your Team

Efficient Ops

Jul 25, 2017 · Operations

Why Google’s SRE Model Matters: Lessons for Modern Ops Teams

This article explains the origins, responsibilities, and team structures of Google Site Reliability Engineering (SRE), compares it with traditional operations roles in companies like Yahoo, Alibaba, and Facebook, and offers practical guidance for building effective SRE or application‑operations teams today.

DevOpsSRESite Reliability Engineering

0 likes · 25 min read

Why Google’s SRE Model Matters: Lessons for Modern Ops Teams

Efficient Ops

Jun 10, 2017 · Operations

What Google’s SRE Book Reveals About Modern Operations

This article introduces the Chinese translation of Google’s SRE book, shares behind‑the‑scenes stories of its creation, and distills key concepts such as the AAA model, Borg architecture, SLOs, toil reduction, and the cultural shift required for reliable large‑scale services.

DevOpsGoogleSRE

0 likes · 20 min read

What Google’s SRE Book Reveals About Modern Operations

DevOps

Apr 18, 2017 · Operations

Understanding Site Reliability Engineering (SRE): Roles, Responsibilities, Skills, and Differences from DevOps

This article explains the concept of Site Reliability Engineering (SRE), its origins at Google, core responsibilities such as IT operations and availability improvement, required skill sets, how it differs from DevOps, and guidance on adopting SRE practices within organizations.

DevOpsIT Service ManagementOperations

0 likes · 12 min read

Understanding Site Reliability Engineering (SRE): Roles, Responsibilities, Skills, and Differences from DevOps

High Availability Architecture

Mar 15, 2017 · Operations

Highlights from SRECon17 Americas 2023 in San Francisco

The article reports on the SRECon17 Americas conference in San Francisco, summarizing keynote talks, panel sessions, and practical insights from industry leaders such as Stripe, Netflix, Google, and IBM on topics ranging from traffic control and container management to on‑call practices and cost considerations for Site Reliability Engineering.

DevOpsGoogleNetflix

0 likes · 6 min read

Highlights from SRECon17 Americas 2023 in San Francisco

Efficient Ops

Sep 18, 2016 · Operations

Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy

This article explores the origins of Site Reliability Engineering, highlights Margaret Hamilton as the likely first SRE through her work on NASA’s Apollo program, and draws lessons on reliability, disaster prevention, and the evolution of modern SRE practices.

Apollo programMargaret HamiltonSRE

0 likes · 10 min read

Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy

MaGe Linux Operations

May 27, 2016 · Operations

Why Google Relies on Software Engineers to Run Its Services: Inside SRE

The article explains Google’s Site Reliability Engineering (SRE) philosophy, how it empowers software engineers to automate operations, the balance between development and reliability, the concept of error budgets, and the cultural shift that turned DevOps into a core practice for large‑scale services.

DevOpsError BudgetOperations Automation

0 likes · 10 min read

Why Google Relies on Software Engineers to Run Its Services: Inside SRE

21CTO

Apr 21, 2016 · Operations

Why Google Lets Software Engineers Run Its Services: Inside Site Reliability Engineering

Google’s near‑perfect uptime is achieved by Site Reliability Engineering, a philosophy that empowers software engineers to automate operations, balance development with reliability, and treat system availability as a core product feature.

DevOpsGoogleSRE

0 likes · 10 min read

Why Google Lets Software Engineers Run Its Services: Inside Site Reliability Engineering