Tagged articles
18 articles
Page 1 of 1
Open Source Linux
Open Source Linux
Oct 11, 2024 · Operations

Essential IT Operations Metrics: Definitions, Formulas, and Benchmarks

This article explains why operations metrics are vital for businesses, describes how tracking availability, failure rate, MTTR, MTBF, response time, throughput, error rate, capacity utilization, latency, data integrity, backup success, recovery time, security patch time, server and network utilization can improve reliability, reduce costs, and boost competitiveness.

AvailabilityIT OperationsMTBF
0 likes · 7 min read
Essential IT Operations Metrics: Definitions, Formulas, and Benchmarks
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
Efficient Ops
Efficient Ops
Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFOperations
0 likes · 14 min read
Mastering SRE: How Error Budgets and SLOs Drive System Reliability
HelloTech
HelloTech
Nov 22, 2022 · Operations

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

MTBFMTTROperations
0 likes · 15 min read
Guidelines for Incident Postmortem and Fault Review
ITPUB
ITPUB
Oct 4, 2022 · Operations

What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage

The article examines B‑Station’s July 2021 outage, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to achieve resilient systems.

MTBFMTTRcircuit breaker
0 likes · 15 min read
What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage
Architects Research Society
Architects Research Society
Sep 14, 2022 · Fundamentals

Understanding Stability and Reliability Testing in Software Development

This article explains the definitions, objectives, importance, and various types of stability and reliability testing—including stress, recovery, failover, and stability tests—while highlighting how these practices reduce system failures, improve MTBF/MTTR, and support informed decision‑making for software quality assurance.

MTBFMTTRPerformance Testing
0 likes · 11 min read
Understanding Stability and Reliability Testing in Software Development
DevOps
DevOps
Aug 17, 2022 · Operations

Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips

This article explains why measuring is essential for continuous delivery, introduces four valuable metrics—deployable package count, cycle time, mean time between failures, and mean time to recovery—and offers practical tips to improve delivery speed and reliability.

Continuous DeliveryDevOpsMTBF
0 likes · 7 min read
Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips
21CTO
21CTO
Jul 16, 2021 · Operations

What Bilibili’s Outage Teaches About Achieving True High Availability

The article analyzes Bilibili’s recent service outage, explains why high availability matters, introduces key metrics like MTBF and MTTR, and outlines practical strategies such as redundancy, rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to build resilient systems.

MTBFMTTROperations
0 likes · 18 min read
What Bilibili’s Outage Teaches About Achieving True High Availability
Wukong Talks Architecture
Wukong Talks Architecture
Jul 14, 2021 · Operations

Understanding High Availability: Lessons from the Bilibili Outage

This article analyzes Bilibili's recent service disruption, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region active‑active deployments to improve system reliability.

Distributed SystemsHAMTBF
0 likes · 13 min read
Understanding High Availability: Lessons from the Bilibili Outage
dbaplus Community
dbaplus Community
Nov 23, 2020 · Operations

Mastering Fault Management: Building a Robust SRE Stability Framework

This article outlines a comprehensive SRE fault‑management framework, covering core responsibilities, stability metrics such as MTBF and MTTR, detailed pre‑, during‑, and post‑incident processes, monitoring, capacity planning, disaster‑recovery, error budgeting, organizational support, and future trends like AIOps and chaos engineering.

Error BudgetMTBFMTTR
0 likes · 30 min read
Mastering Fault Management: Building a Robust SRE Stability Framework
Architect
Architect
Jan 22, 2016 · Operations

System Reliability and Availability: Insights from the Alipay Outage and YunOS

The article examines system reliability concepts such as availability, MTBF, MTTR, and outage classifications, analyzes the Alipay service interruption, discusses various redundancy and failover strategies, and explores YunOS reliability testing and design practices to improve overall system robustness.

AvailabilityMTBFYunOS
0 likes · 15 min read
System Reliability and Availability: Insights from the Alipay Outage and YunOS