Tag

MTBF

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
Architects Research Society
Architects Research Society
Oct 5, 2023 · Fundamentals

Understanding Stability and Reliability Testing in Software Development

This article explains the definitions, objectives, importance, and types of stability and reliability testing in software development, highlighting how these tests improve system availability, reduce failure risk, and guide corrective actions to lower maintenance costs.

MTBFMTTRquality assurance
0 likes · 14 min read
Understanding Stability and Reliability Testing in Software Development
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 25, 2023 · Operations

Understanding MTTR, MTBF, and MTTF: Fault Metrics for Reliability Engineering

This article explains the essential fault metrics MTTR, MTBF, and MTTF, their definitions, calculations, and practical importance for SRE and operations teams to improve system availability, guide maintenance strategies, and make data‑driven reliability decisions.

MTBFMTTFMTTR
0 likes · 11 min read
Understanding MTTR, MTBF, and MTTF: Fault Metrics for Reliability Engineering
Efficient Ops
Efficient Ops
Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFOperations
0 likes · 14 min read
Mastering SRE: How Error Budgets and SLOs Drive System Reliability
HelloTech
HelloTech
Nov 22, 2022 · Operations

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

High AvailabilityMTBFMTTR
0 likes · 15 min read
Guidelines for Incident Postmortem and Fault Review
Architects Research Society
Architects Research Society
Sep 14, 2022 · Fundamentals

Understanding Stability and Reliability Testing in Software Development

This article explains the definitions, objectives, importance, and various types of stability and reliability testing—including stress, recovery, failover, and stability tests—while highlighting how these practices reduce system failures, improve MTBF/MTTR, and support informed decision‑making for software quality assurance.

MTBFMTTRperformance testing
0 likes · 11 min read
Understanding Stability and Reliability Testing in Software Development
DevOps
DevOps
Aug 17, 2022 · Operations

Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips

This article explains why measuring is essential for continuous delivery, introduces four valuable metrics—deployable package count, cycle time, mean time between failures, and mean time to recovery—and offers practical tips to improve delivery speed and reliability.

Continuous DeliveryDevOpsMTBF
0 likes · 7 min read
Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips
Wukong Talks Architecture
Wukong Talks Architecture
Jul 14, 2021 · Operations

Understanding High Availability: Lessons from the Bilibili Outage

This article analyzes Bilibili's recent service disruption, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region active‑active deployments to improve system reliability.

Distributed SystemsHAHigh Availability
0 likes · 13 min read
Understanding High Availability: Lessons from the Bilibili Outage
Continuous Delivery 2.0
Continuous Delivery 2.0
Jan 19, 2021 · Operations

Understanding MTTR, MTBF, and MTTF: Key Reliability Metrics for SRE

This article explains the definitions, calculations, and practical importance of MTTR, MTBF, and MTTF for reliability engineering, showing how accurate data and proper metric use enable SRE teams to improve system availability, plan maintenance, and reduce downtime.

MTBFMTTFMTTR
0 likes · 13 min read
Understanding MTTR, MTBF, and MTTF: Key Reliability Metrics for SRE
Efficient Ops
Efficient Ops
Apr 10, 2018 · Operations

Why Durability and Availability Matter: Uncovering the Real Meaning Behind Storage Reliability

This article demystifies reliability by clarifying the difference between durability and availability, exposing common misconceptions about MTBF, analyzing real‑world disk failure data, and presenting a practical formula for calculating the health probability of distributed storage systems.

Distributed SystemsMTBFavailability
0 likes · 13 min read
Why Durability and Availability Matter: Uncovering the Real Meaning Behind Storage Reliability
Architects' Tech Alliance
Architects' Tech Alliance
Aug 31, 2016 · Fundamentals

Understanding Reliability Evaluation for Storage, Servers, and Distributed Systems

The article explains how reliability of storage, servers, and distributed systems is assessed using standards, models like MTBF/MTTR, RAS features, CAP/BASE theories, and end‑to‑end solutions, emphasizing the gap between theoretical metrics and real‑world operational data.

Distributed SystemsMTBFRAS
0 likes · 13 min read
Understanding Reliability Evaluation for Storage, Servers, and Distributed Systems
Architect
Architect
Jan 22, 2016 · Operations

System Reliability and Availability: Insights from the Alipay Outage and YunOS

The article examines system reliability concepts such as availability, MTBF, MTTR, and outage classifications, analyzes the Alipay service interruption, discusses various redundancy and failover strategies, and explores YunOS reliability testing and design practices to improve overall system robustness.

Cloud ComputingMTBFYunOS
0 likes · 15 min read
System Reliability and Availability: Insights from the Alipay Outage and YunOS