Tagged articles
46 articles
Page 1 of 1
DevOps Coach
DevOps Coach
Dec 29, 2025 · Operations

Mastering System Reliability: Lessons from Google, Netflix, and Meta

Learn how Google, Netflix, and Meta pioneered modern reliability practices—SRE’s data‑driven metrics, Netflix’s chaos engineering, and Meta’s self‑healing automation—and get a step‑by‑step handbook to apply these concepts, avoid common traps, and build resilient systems at any scale.

AutomationSite Reliability Engineeringchaos engineering
0 likes · 11 min read
Mastering System Reliability: Lessons from Google, Netflix, and Meta
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 13, 2025 · Operations

How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard

This article traces Google Site Reliability Engineering from its 2003 inception addressing scale crises, through organizational growth, core principles, team structures, and recent security integrations, showing how SRE transformed operations into a software‑engineering discipline that drives reliable, scalable digital services.

Error BudgetGoogleOperations
0 likes · 13 min read
How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 11, 2025 · Operations

Mastering Enterprise SRE: From Core Concepts to Practical Implementation

This comprehensive guide explains the core principles of Site Reliability Engineering, outlines a phased roadmap for enterprise adoption, details essential monitoring, automation, and reliability platforms, and addresses team structure, talent development, common challenges, and real‑world success stories to help organizations build effective SRE practices.

AutomationSRESite Reliability Engineering
0 likes · 16 min read
Mastering Enterprise SRE: From Core Concepts to Practical Implementation
MaGe Linux Operations
MaGe Linux Operations
Sep 28, 2025 · Operations

What Core Skills Do SRE Engineers Need to Master?

This article outlines the essential technical, incident‑response, reliability‑management, collaboration, and systemic‑thinking abilities that Site Reliability Engineering (SRE) professionals must develop to ensure high‑availability, stable services in modern internet environments.

CollaborationSRESite Reliability Engineering
0 likes · 5 min read
What Core Skills Do SRE Engineers Need to Master?
MaGe Linux Operations
MaGe Linux Operations
Dec 26, 2024 · Operations

What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution

This article provides a comprehensive overview of internet operations, detailing the three core pillars of service‑centered stability, security, and efficiency, describing the classification of operation roles, their responsibilities, the evolution of operational practices, and practical advice for aspiring operation engineers.

InfrastructureSite Reliability Engineering
0 likes · 20 min read
What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution
Efficient Ops
Efficient Ops
Apr 8, 2024 · Operations

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

OncallOperationsSRE
0 likes · 22 min read
What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices
DeWu Technology
DeWu Technology
Dec 8, 2023 · Operations

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

On November 25, Dewu Technology hosted an SRE Stability Engineering salon in Hangzhou where experts from Alibaba, Tencent, Ant Group and Dewu shared practical insights on C‑end link reliability, Alibaba’s system stability operations, Tencent Game’s cloud‑native SRE practices, and Ant Group’s chaos engineering, concluding with a Q&A and resource distribution.

Cloud NativeOperationsSRE
0 likes · 7 min read
SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services
DevOps
DevOps
Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

GoogleSLISLO
0 likes · 4 min read
An Overview of the Google SRE Workbook and Core SRE Foundations
Tech Architecture Stories
Tech Architecture Stories
Jul 23, 2023 · Operations

Why Every Backend Engineer Should Read Google’s SRE Handbook

The article recommends two essential Google SRE books for backend developers, explains what SRE is, how it differs from traditional operations, and shows how the concepts like SLI/SLO, incident postmortems, and reliability engineering can be applied to improve system availability and stability.

Backend DevelopmentOperationsSRE
0 likes · 4 min read
Why Every Backend Engineer Should Read Google’s SRE Handbook
Efficient Ops
Efficient Ops
May 21, 2023 · Operations

From Apollo to Google: How Margaret Hamilton Shaped Modern SRE

This article traces the origins of Site Reliability Engineering from Margaret Hamilton’s pioneering work on the Apollo program, through Google’s formal SRE team creation, and highlights the key differences between SRE and traditional operations practices.

GoogleMargaret HamiltonOperations
0 likes · 7 min read
From Apollo to Google: How Margaret Hamilton Shaped Modern SRE
DevOps Cloud Academy
DevOps Cloud Academy
May 10, 2023 · Operations

Understanding the Role of Site Reliability Engineering (SRE) in DevOps

This article explains why Site Reliability Engineering (SRE) and DevOps are both essential for modern software development, compares their objectives, outlines their complementary roles, and highlights the fundamental differences that help organizations achieve faster releases with higher reliability.

DevOpsSRESite Reliability Engineering
0 likes · 8 min read
Understanding the Role of Site Reliability Engineering (SRE) in DevOps
DevOps Cloud Academy
DevOps Cloud Academy
Dec 31, 2022 · Operations

Google Site Reliability Engineering (SRE) Principles and Engagement Model

The article explains Google’s Site Reliability Engineering (SRE) team, its mission to balance reliability and velocity through automation, the engagement model with development teams, funding principles, and a set of guiding principles that shape how SRE collaborates, scopes, and delivers value across services.

Engagement ModelGoogleReliability
0 likes · 29 min read
Google Site Reliability Engineering (SRE) Principles and Engagement Model
Bilibili Tech
Bilibili Tech
Aug 12, 2022 · Operations

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.

AlertingMetricsSLO
0 likes · 18 min read
SLO Implementation and Alerting Strategies – Bilibili SRE Practices
DevOps
DevOps
Jul 8, 2022 · Operations

Nine Essential Skills Every Modern Site Reliability Engineer Should Master

The article outlines the nine core competencies—network expertise, Linux/Unix knowledge, cloud computing, CI/CD pipelines, QA automation, security engineering, DevOps, incident management, and post‑incident review—that enable SREs to ensure the availability, performance, and reliability of complex distributed systems.

DevOpsSRESite Reliability Engineering
0 likes · 6 min read
Nine Essential Skills Every Modern Site Reliability Engineer Should Master
Architecture Talk
Architecture Talk
Jun 27, 2022 · Operations

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

This article explains the motivations behind Site Reliability Engineering (SRE), outlines its strategic goals, defines key concepts such as SLI, SLO, SLA and error budget, introduces the four golden metrics for monitoring distributed systems, and provides practical guidance on building, operating, and continuously improving an SRE practice.

Error BudgetSLISLO
0 likes · 14 min read
Why Build an SRE System? A Complete Guide to Site Reliability Engineering
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

InfrastructureOncallOperations
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
Architect
Architect
Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI/SLO
0 likes · 22 min read
A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices
IT Architects Alliance
IT Architects Alliance
Dec 1, 2021 · Operations

What Does an SRE Actually Do? A Deep Dive into Roles and Practices

This article explains the origins of Site Reliability Engineering, breaks down its three main layers—Infrastructure, Platform, and Business SRE—covers day‑one and day‑2 deployment, on‑call processes, SLI/SLO design, post‑mortems, capacity planning, user support, and offers practical advice for aspiring SREs.

InfrastructureOncallOperations
0 likes · 24 min read
What Does an SRE Actually Do? A Deep Dive into Roles and Practices
Programmer DD
Programmer DD
Nov 16, 2021 · Operations

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its origins at Google, the challenges of hiring, the three-layer model of infrastructure, platform, and business SRE, and provides detailed responsibilities, on‑call practices, SLI/SLO management, capacity planning, and career advice for aspiring SREs.

InfrastructureOncallSLI
0 likes · 23 min read
What Does an SRE Do? A Practical Guide to Site Reliability Engineering
ByteDance ADFE Team
ByteDance ADFE Team
Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

AlertingError BudgetSLI
0 likes · 7 min read
From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
Efficient Ops
Efficient Ops
Mar 31, 2021 · Operations

Top 7 SRE Interview Questions Every Candidate Should Master

This article outlines the seven most important Site Reliability Engineering interview questions, explains why they matter, and provides an overview of the upcoming SRE Foundation course that equips professionals with the principles, practices, and tools needed for reliable, scalable systems.

SRESRE FoundationSite Reliability Engineering
0 likes · 9 min read
Top 7 SRE Interview Questions Every Candidate Should Master
DevOps
DevOps
Mar 18, 2021 · Operations

Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability

Site Reliability Engineering (SRE) combines software engineering with operations to ensure scalable, highly reliable systems, outlining the collaboration between product development and SRE roles, the software lifecycle, stability value, and practical frameworks for observability, controllability, and best‑practice implementation.

SRESite Reliability Engineeringsoftware lifecycle
0 likes · 12 min read
Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability
21CTO
21CTO
Feb 3, 2021 · Operations

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

This article explains the role of Site Reliability Engineering (SRE) in bridging product and foundational technology development, outlines the software lifecycle, describes how SRE ensures system stability through controllability, observability, and protection, and provides practical best‑practice checklists and maturity levels for evaluating and improving reliability.

ObservabilityOperationsSRE
0 likes · 13 min read
Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle
Efficient Ops
Efficient Ops
Jan 5, 2021 · Operations

Master Site Reliability Engineering: Inside the SRE Foundation Course

The SRE Foundation course introduces site reliability engineering principles, practices, and tools, explaining why perfect reliability is impractical, outlining SRE responsibilities, detailing the curriculum across eight modules, and identifying the diverse professionals—from engineers to managers—who can benefit from mastering reliability, scalability, and automation.

CourseReliabilitySRE
0 likes · 7 min read
Master Site Reliability Engineering: Inside the SRE Foundation Course
21CTO
21CTO
Jan 2, 2021 · Operations

Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets

This article presents a comprehensive overview of Site Reliability Engineering (SRE) as shared by Google SRE expert Ramón Medrano Llamas, covering SRE fundamentals, a typical day’s workflow, design principles for massive scale, fault‑tolerant architecture, monitoring, SLI/SLO metrics, redundancy strategies, disaster recovery, and operational best practices.

OperationsSREScalable Systems
0 likes · 13 min read
Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets
Efficient Ops
Efficient Ops
Nov 4, 2020 · Operations

Unlocking SRE: Foundations, Principles, and Career Paths Explained

This article clarifies common misconceptions about Site Reliability Engineering, outlines the role’s responsibilities, presents the SRE Foundation course syllabus and target audience, and highlights the GOPS 2020 Global Operations Conference where the training is offered.

DevOpsReliabilitySRE
0 likes · 7 min read
Unlocking SRE: Foundations, Principles, and Career Paths Explained
Efficient Ops
Efficient Ops
Aug 23, 2020 · Operations

Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020

The SRE Foundation course presented at the GOPS 2020 Global Operations Conference in Shenzhen introduces core Site Reliability Engineering principles, practical tools, and certification preparation through eight detailed modules, targeting a wide range of IT professionals and business stakeholders.

DevOpsSRESite Reliability Engineering
0 likes · 6 min read
Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020
DevOps
DevOps
Aug 13, 2020 · Operations

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

This article outlines ByteDance’s adoption of chaos engineering, describing its background, industry examples, the evolution of internal fault‑injection platforms across three generations, the fault model and center design, experiment principles, and future plans for infrastructure‑level chaos and automated diagnostics.

Distributed SystemsFault InjectionObservability
0 likes · 21 min read
ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions
Efficient Ops
Efficient Ops
Jul 28, 2020 · Operations

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

This article details Zhejiang Mobile's adaptation of Google‑originated Site Reliability Engineering to a telecom environment, outlining a three‑layer capability framework, standardized processes, integrated platforms, and measurable outcomes that demonstrate how agile SRE practices can boost reliability and scalability in traditional industries.

InfrastructureSRESite Reliability Engineering
0 likes · 11 min read
How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint
dbaplus Community
dbaplus Community
Jul 13, 2020 · Operations

14 Expert Q&A on Building an Effective SRE System for Fault Management

In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.

DevOpsGrafanaSRE
0 likes · 10 min read
14 Expert Q&A on Building an Effective SRE System for Fault Management
dbaplus Community
dbaplus Community
Dec 30, 2019 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

This article explains the origins of Site Reliability Engineering (SRE), outlines the responsibilities of SRE teams, and details Alibaba Cloud’s ECS SRE practices—including capacity planning, performance optimization, full‑stack stability governance, automated release pipelines, on‑call processes, and the core principles and mindset that guide modern SRE work.

AutomationOperationsSRE
0 likes · 28 min read
How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 29, 2019 · Operations

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

This article explains how Alibaba's Elastic Compute Service (ECS) SRE team tackled massive traffic, database bottlenecks, alert overload, and resource inconsistencies by establishing a full‑stack reliability organization, upgrading core components, automating pipelines, and instituting rigorous monitoring, incident response, and change‑management processes.

OperationsSRESite Reliability Engineering
0 likes · 27 min read
How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration
Efficient Ops
Efficient Ops
Jan 29, 2018 · Operations

Will Operations Be Replaced? Exploring the Role, Skills, and Future of DevOps

This article demystifies operations engineering by explaining its core responsibilities, addressing common myths about being replaced by cloud or DevOps, outlining the product lifecycle handoff, comparing Ops engineers with Ops developers, and proposing a skill‑level framework to guide career growth.

DevOpsSite Reliability Engineeringskill assessment
0 likes · 12 min read
Will Operations Be Replaced? Exploring the Role, Skills, and Future of DevOps
Efficient Ops
Efficient Ops
Sep 27, 2017 · Operations

From Ops to SRE: What Google’s Site Reliability Model Means for Your Team

The article reflects on the shift from traditional operations to Site Reliability Engineering (SRE), comparing Google’s SRE practices with those of a Chinese cloud provider, and explores infrastructure, tooling, team structure, and cultural challenges while drawing practical lessons for engineers.

DevOpsGoogleSRE
0 likes · 19 min read
From Ops to SRE: What Google’s Site Reliability Model Means for Your Team
Efficient Ops
Efficient Ops
Jul 25, 2017 · Operations

Why Google’s SRE Model Matters: Lessons for Modern Ops Teams

This article explains the origins, responsibilities, and team structures of Google Site Reliability Engineering (SRE), compares it with traditional operations roles in companies like Yahoo, Alibaba, and Facebook, and offers practical guidance for building effective SRE or application‑operations teams today.

DevOpsSRESite Reliability Engineering
0 likes · 25 min read
Why Google’s SRE Model Matters: Lessons for Modern Ops Teams
Efficient Ops
Efficient Ops
Jun 10, 2017 · Operations

What Google’s SRE Book Reveals About Modern Operations

This article introduces the Chinese translation of Google’s SRE book, shares behind‑the‑scenes stories of its creation, and distills key concepts such as the AAA model, Borg architecture, SLOs, toil reduction, and the cultural shift required for reliable large‑scale services.

DevOpsGoogleInfrastructure
0 likes · 20 min read
What Google’s SRE Book Reveals About Modern Operations
High Availability Architecture
High Availability Architecture
Mar 15, 2017 · Operations

Highlights from SRECon17 Americas 2023 in San Francisco

The article reports on the SRECon17 Americas conference in San Francisco, summarizing keynote talks, panel sessions, and practical insights from industry leaders such as Stripe, Netflix, Google, and IBM on topics ranging from traffic control and container management to on‑call practices and cost considerations for Site Reliability Engineering.

DevOpsGoogleNetflix
0 likes · 6 min read
Highlights from SRECon17 Americas 2023 in San Francisco
Efficient Ops
Efficient Ops
Sep 18, 2016 · Operations

Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy

This article explores the origins of Site Reliability Engineering, highlights Margaret Hamilton as the likely first SRE through her work on NASA’s Apollo program, and draws lessons on reliability, disaster prevention, and the evolution of modern SRE practices.

Apollo programMargaret HamiltonSRE
0 likes · 10 min read
Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy
MaGe Linux Operations
MaGe Linux Operations
May 27, 2016 · Operations

Why Google Relies on Software Engineers to Run Its Services: Inside SRE

The article explains Google’s Site Reliability Engineering (SRE) philosophy, how it empowers software engineers to automate operations, the balance between development and reliability, the concept of error budgets, and the cultural shift that turned DevOps into a core practice for large‑scale services.

DevOpsError BudgetOperations Automation
0 likes · 10 min read
Why Google Relies on Software Engineers to Run Its Services: Inside SRE