Tagged articles
26 articles
Page 1 of 1
Liangxu Linux
Liangxu Linux
Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetObservabilityOperations
0 likes · 13 min read
How to Define SLIs, SLOs, and SLAs for Effective SRE Practices
JD Cloud Developers
JD Cloud Developers
Nov 27, 2024 · Operations

Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services

This article explains the core concepts of SLA, SLO, and SLI, demonstrates how to set realistic service level objectives, manage alert noise, and apply practical examples—including API, MQ, and scheduled task monitoring—to improve system reliability and performance during high‑traffic events like 11.11 promotions.

SLASLISLO
0 likes · 23 min read
Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
dbaplus Community
dbaplus Community
Aug 28, 2023 · Operations

How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems

This guide explains how SRE teams should define service level indicators, objectives, and agreements, design reliable and observable architectures, manage error budgets, assess risks, handle incidents, and integrate development practices to improve system stability and performance.

Error BudgetReliabilitySLI
0 likes · 15 min read
How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems
Tech Architecture Stories
Tech Architecture Stories
Aug 15, 2023 · Cloud Native

Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation

This article explains how measurement (SLI/SLO), governance (architecture refactoring, MTTx), and validation (chaos engineering, disaster drills) interrelate in microservice systems, illustrating how observability drives governance actions, governance improves metrics, and validation reinforces both through continuous testing.

MicroservicesObservabilitySLI
0 likes · 4 min read
Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation
DevOps
DevOps
Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

GoogleSLISLO
0 likes · 4 min read
An Overview of the Google SRE Workbook and Core SRE Foundations
MaGe Linux Operations
MaGe Linux Operations
May 7, 2023 · Operations

How Meta’s SLICK Transforms SLO Management for Reliable Services

This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.

MetaObservabilityReliability
0 likes · 13 min read
How Meta’s SLICK Transforms SLO Management for Reliable Services
21CTO
21CTO
Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability
0 likes · 14 min read
Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Error BudgetObservabilityOperations
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
Architects Research Society
Architects Research Society
Aug 24, 2022 · Operations

Choosing Appropriate SLIs and Defining SLOs for Reliable Services

This guide explains how to select suitable service‑level indicators (SLIs), define customer‑centric service‑level objectives (SLOs), use error budgets, and iteratively improve reliability for various system types such as services, data processing, and storage, with practical recommendations for Google Cloud environments.

Google CloudSLISLO
0 likes · 10 min read
Choosing Appropriate SLIs and Defining SLOs for Reliable Services
DevOps
DevOps
Jul 25, 2022 · Operations

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

This article provides a comprehensive overview of Site Reliability Engineering, explaining its origins, core responsibilities across infrastructure, platform, and business layers, daily tasks such as deployment, on‑call duties, SLI/SLO management, incident post‑mortems, capacity planning, and user support, as well as career advice for aspiring SREs.

InfrastructureOncallReliability
0 likes · 21 min read
Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)
Architecture Talk
Architecture Talk
Jun 27, 2022 · Operations

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

This article explains the motivations behind Site Reliability Engineering (SRE), outlines its strategic goals, defines key concepts such as SLI, SLO, SLA and error budget, introduces the four golden metrics for monitoring distributed systems, and provides practical guidance on building, operating, and continuously improving an SRE practice.

Error BudgetSLISLO
0 likes · 14 min read
Why Build an SRE System? A Complete Guide to Site Reliability Engineering
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

InfrastructureOncallOperations
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
IT Architects Alliance
IT Architects Alliance
Dec 1, 2021 · Operations

What Does an SRE Actually Do? A Deep Dive into Roles and Practices

This article explains the origins of Site Reliability Engineering, breaks down its three main layers—Infrastructure, Platform, and Business SRE—covers day‑one and day‑2 deployment, on‑call processes, SLI/SLO design, post‑mortems, capacity planning, user support, and offers practical advice for aspiring SREs.

InfrastructureOncallOperations
0 likes · 24 min read
What Does an SRE Actually Do? A Deep Dive into Roles and Practices
Programmer DD
Programmer DD
Nov 16, 2021 · Operations

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its origins at Google, the challenges of hiring, the three-layer model of infrastructure, platform, and business SRE, and provides detailed responsibilities, on‑call practices, SLI/SLO management, capacity planning, and career advice for aspiring SREs.

InfrastructureOncallSLI
0 likes · 23 min read
What Does an SRE Do? A Practical Guide to Site Reliability Engineering
ByteDance ADFE Team
ByteDance ADFE Team
Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

AlertingError BudgetSLI
0 likes · 7 min read
From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
ITPUB
ITPUB
Jun 9, 2017 · Operations

Mastering Effective Monitoring: From Basics to the USE Method

This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.

MetricsOperationsSLI
0 likes · 10 min read
Mastering Effective Monitoring: From Basics to the USE Method
Efficient Ops
Efficient Ops
Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

OperationsSLASLI
0 likes · 11 min read
How to Design Effective SLOs and SLAs: A Technical Deep Dive