Tag

SLO

0 views collected around this technical thread.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Jun 1, 2025 · Operations

Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems

This article explains the differences between SLA, SLO, and SLI, shows how to express user expectations as concrete service level agreements, and introduces essential high‑availability metrics such as availability percentages, MTBF, MTTR, RPO, RTO, WRT, and MTD for reliable system design.

High AvailabilitySLASLI
0 likes · 9 min read
Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems
Efficient Ops
Efficient Ops
Mar 4, 2025 · Operations

Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems

This article explains how SRE teams should collaboratively define Service Level Indicators, Objectives, and Agreements, and then cover reliability, performance, observability signals, error budgeting, risk management, incident handling, and the engineering work needed to build robust cloud‑native platforms.

Error BudgetObservabilitySLI
0 likes · 13 min read
Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems
58 Tech
58 Tech
Nov 27, 2024 · Operations

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

Distributed TracingError BudgetMonitoring
0 likes · 16 min read
Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned
Efficient Ops
Efficient Ops
Mar 25, 2024 · Operations

Why SRE Exists and How It Solves Modern Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how SRE teams use SLOs, monitoring, and scenario drills to improve system reliability, performance, and observability in complex production environments.

DevOpsMonitoringSLO
0 likes · 12 min read
Why SRE Exists and How It Solves Modern Reliability Challenges
Efficient Ops
Efficient Ops
Dec 20, 2023 · Operations

How Bilibili Implements SLO Engineering to Boost Service Reliability

This article details Bilibili's practical SLO engineering approach, covering foundational components, SLI selection, application and business level SLIs, alerting strategies, SLO‑driven quality operations, and the GOC framework for rapid fault discovery, localization, and recovery, illustrating how reliability is systematically improved.

MonitoringReliability EngineeringSLO
0 likes · 16 min read
How Bilibili Implements SLO Engineering to Boost Service Reliability
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 27, 2023 · Operations

Applying the VALET Pattern Language for SRE Transformation at Home Depot (THD)

The article explains how Home Depot (THD) adopted the VALET pattern language—Volume, Availability, Latency, Error, and Ticket—to unify service‑level objectives, automate data collection, build dashboards, and improve SRE practices across its massive retail and e‑commerce infrastructure.

Home DepotMonitoringSLO
0 likes · 9 min read
Applying the VALET Pattern Language for SRE Transformation at Home Depot (THD)
DevOps
DevOps
Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

DevOpsGoogleSLI
0 likes · 4 min read
An Overview of the Google SRE Workbook and Core SRE Foundations
Efficient Ops
Efficient Ops
Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFReliability Engineering
0 likes · 14 min read
Mastering SRE: How Error Budgets and SLOs Drive System Reliability
Efficient Ops
Efficient Ops
May 31, 2023 · Operations

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.

Quality OperationsReliability EngineeringSLO
0 likes · 11 min read
How Tencent Scales SRE: Building a SLO‑Based Quality Operations System
Bilibili Tech
Bilibili Tech
Oct 29, 2022 · Operations

Stability Building and SLO Operations After the “713 Incident”

The deck outlines post‑incident stability enhancements and the adoption of Service Level Objectives after the “713” fault, detailing failure analysis, reliability upgrades, monitoring practices, and the definition and operation of SLOs to sustain system quality, illustrated through architecture diagrams and reliability metrics.

Reliability EngineeringSLOincident management
0 likes · 1 min read
Stability Building and SLO Operations After the “713 Incident”
Efficient Ops
Efficient Ops
Aug 31, 2022 · Operations

How Intelligent Operations and Observability Transform Cloud‑Native Environments

In this talk, Wu Yakun from Guance Cloud explains the shortcomings of traditional operations, introduces intelligent, data‑driven approaches for the cloud‑native era, and outlines how unified data collection, observability, and SLO‑based monitoring can dramatically improve fault detection and system reliability.

Cloud NativeIntelligent OperationsMonitoring
0 likes · 16 min read
How Intelligent Operations and Observability Transform Cloud‑Native Environments
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Error BudgetObservabilitySLI
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
Architects Research Society
Architects Research Society
Aug 24, 2022 · Operations

Choosing Appropriate SLIs and Defining SLOs for Reliable Services

This guide explains how to select suitable service‑level indicators (SLIs), define customer‑centric service‑level objectives (SLOs), use error budgets, and iteratively improve reliability for various system types such as services, data processing, and storage, with practical recommendations for Google Cloud environments.

Error BudgetGoogle CloudMonitoring
0 likes · 10 min read
Choosing Appropriate SLIs and Defining SLOs for Reliable Services
Bilibili Tech
Bilibili Tech
Aug 12, 2022 · Operations

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.

MetricsMonitoringSLO
0 likes · 18 min read
SLO Implementation and Alerting Strategies – Bilibili SRE Practices
Bilibili Tech
Bilibili Tech
Aug 2, 2022 · Operations

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Cloud NativeError BudgetReliability Engineering
0 likes · 21 min read
Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections
DevOps
DevOps
Jul 25, 2022 · Operations

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

This article provides a comprehensive overview of Site Reliability Engineering, explaining its origins, core responsibilities across infrastructure, platform, and business layers, daily tasks such as deployment, on‑call duties, SLI/SLO management, incident post‑mortems, capacity planning, and user support, as well as career advice for aspiring SREs.

InfrastructureOncallSLI
0 likes · 21 min read
Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

InfrastructureOncallSLI
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
IT Architects Alliance
IT Architects Alliance
Apr 12, 2022 · Operations

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

This article explains Site Reliability Engineering (SRE), covering its origins, core responsibilities, key concepts such as SLI/SLO/SLA and error budgets, the four golden monitoring metrics, risk analysis, and practical guidance on building reliable services using tools like Prometheus and Grafana.

Error BudgetMonitoringSLI
0 likes · 15 min read
Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices