Tagged articles
23 articles
Page 1 of 1
FunTester
FunTester
Apr 28, 2026 · Operations

How Self‑Healing Automation Platforms Transform SRE Practices

The article explains how a self‑healing platform improves SRE reliability by reducing MTTR, preserving error‑budget, automating high‑impact incident remediation, enforcing safety guardrails, and shifting team focus from firefighting to sustainable reliability engineering.

AutomationError BudgetMTTR
0 likes · 10 min read
How Self‑Healing Automation Platforms Transform SRE Practices
DevOps Coach
DevOps Coach
Jan 3, 2026 · Operations

From DevOps Chaos to Platform Power: How Observability Becomes a Strategic Capability

The article explores how large organizations transform chaotic, tool‑centric observability practices into a platform capability driven by SLOs, error budgets, GitOps, and service‑mesh telemetry, using real‑world case studies to show measurable improvements in reliability, deployment speed, and team culture.

DORA metricsError BudgetSLO
0 likes · 25 min read
From DevOps Chaos to Platform Power: How Observability Becomes a Strategic Capability
DevOps Coach
DevOps Coach
Nov 10, 2025 · Operations

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

This guide explains the SRE framework—SLA, SLO, SLI hierarchy, golden signals, error budgets, and DORA metrics—showing how to instrument a Python app with OpenTelemetry, query Prometheus, avoid common pitfalls, and adopt a cultural and technical process that balances feature velocity with system stability.

DoRAError BudgetGolden Signals
0 likes · 18 min read
How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 13, 2025 · Operations

How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard

This article traces Google Site Reliability Engineering from its 2003 inception addressing scale crises, through organizational growth, core principles, team structures, and recent security integrations, showing how SRE transformed operations into a software‑engineering discipline that drives reliable, scalable digital services.

Error BudgetGoogleOperations
0 likes · 13 min read
How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard
Liangxu Linux
Liangxu Linux
Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetObservabilityOperations
0 likes · 13 min read
How to Define SLIs, SLOs, and SLAs for Effective SRE Practices
Tech Architecture Stories
Tech Architecture Stories
Dec 28, 2024 · Operations

Why Preventing Small Issues Is the Key to System Stability

The article explains how early detection and preventive measures—such as comprehensive monitoring, rate limiting, chaos testing, and proper SLOs—are essential for maintaining system stability and avoiding larger incidents, drawing on SRE principles and the incident triangle theory.

Error BudgetOperationsSRE
0 likes · 4 min read
Why Preventing Small Issues Is the Key to System Stability
58 Tech
58 Tech
Nov 27, 2024 · Operations

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

Distributed TracingError BudgetObservability
0 likes · 16 min read
Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned
dbaplus Community
dbaplus Community
Feb 4, 2024 · Operations

How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations

This article details Ant Group's practical implementation of Service Level Objectives (SLO) and AIOps to achieve fine‑grained operations, covering SLO fundamentals, health‑score architecture, GitOps‑based data pipelines, error‑budget alerting, AI‑driven anomaly detection, fault localization techniques, and real‑world case studies on dashboards, Kubernetes SLOs, and emergency response workflows.

Error BudgetFault LocalizationKubernetes
0 likes · 38 min read
How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations
dbaplus Community
dbaplus Community
Aug 28, 2023 · Operations

How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems

This guide explains how SRE teams should define service level indicators, objectives, and agreements, design reliable and observable architectures, manage error budgets, assess risks, handle incidents, and integrate development practices to improve system stability and performance.

Error BudgetReliabilitySLI
0 likes · 15 min read
How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems
Efficient Ops
Efficient Ops
Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFOperations
0 likes · 14 min read
Mastering SRE: How Error Budgets and SLOs Drive System Reliability
21CTO
21CTO
Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability
0 likes · 14 min read
Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Error BudgetObservabilityOperations
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
Bilibili Tech
Bilibili Tech
Aug 2, 2022 · Operations

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Cloud NativeError BudgetOperations
0 likes · 21 min read
Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections
Architecture Talk
Architecture Talk
Jun 27, 2022 · Operations

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

This article explains the motivations behind Site Reliability Engineering (SRE), outlines its strategic goals, defines key concepts such as SLI, SLO, SLA and error budget, introduces the four golden metrics for monitoring distributed systems, and provides practical guidance on building, operating, and continuously improving an SRE practice.

Error BudgetSLISLO
0 likes · 14 min read
Why Build an SRE System? A Complete Guide to Site Reliability Engineering
ByteDance ADFE Team
ByteDance ADFE Team
Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

AlertingError BudgetSLI
0 likes · 7 min read
From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
dbaplus Community
dbaplus Community
Nov 23, 2020 · Operations

Mastering Fault Management: Building a Robust SRE Stability Framework

This article outlines a comprehensive SRE fault‑management framework, covering core responsibilities, stability metrics such as MTBF and MTTR, detailed pre‑, during‑, and post‑incident processes, monitoring, capacity planning, disaster‑recovery, error budgeting, organizational support, and future trends like AIOps and chaos engineering.

Error BudgetMTBFMTTR
0 likes · 30 min read
Mastering Fault Management: Building a Robust SRE Stability Framework
Efficient Ops
Efficient Ops
Feb 5, 2020 · Operations

Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams

This article examines the inherent tension between operations and development, explains Google’s error‑budget and SLO approach, and shares practical DevOps, on‑call, automation, and talent strategies that help ops teams improve efficiency while maintaining product reliability.

AutomationError BudgetOn-Call
0 likes · 9 min read
Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams
MaGe Linux Operations
MaGe Linux Operations
Jan 31, 2020 · Operations

Balancing Stability and Speed: Lessons from Google SRE for Modern Ops

This article examines the tension between operations and development teams, explains Google's SRE error‑budget model, and shares practical reflections on engineering ops, on‑call rotation, automation, and talent development to achieve a sustainable balance between product stability and rapid innovation.

AutomationDevOpsError Budget
0 likes · 8 min read
Balancing Stability and Speed: Lessons from Google SRE for Modern Ops
MaGe Linux Operations
MaGe Linux Operations
May 27, 2016 · Operations

Why Google Relies on Software Engineers to Run Its Services: Inside SRE

The article explains Google’s Site Reliability Engineering (SRE) philosophy, how it empowers software engineers to automate operations, the balance between development and reliability, the concept of error budgets, and the cultural shift that turned DevOps into a core practice for large‑scale services.

DevOpsError BudgetOperations Automation
0 likes · 10 min read
Why Google Relies on Software Engineers to Run Its Services: Inside SRE