Tag

MTTR

0 views collected around this technical thread.

Bilibili Tech
Bilibili Tech
Aug 16, 2024 · Operations

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

MTTRSREautomation
0 likes · 22 min read
Design and Implementation of Bilibili's Emergency Response Center for Incident Management
DevOps Operations Practice
DevOps Operations Practice
Jun 17, 2024 · Operations

Key DevOps Metrics: Deployment Frequency, Lead Time, Change Failure Rate, MTTR, and Customer Satisfaction

This article explains essential DevOps metrics—including deployment frequency, lead time for changes, change failure rate, mean time to recovery, and customer satisfaction—detailing why they matter, how to measure them, and practical practices to improve each metric for more efficient and reliable software delivery.

Change Failure RateCustomer SatisfactionDeployment Frequency
0 likes · 9 min read
Key DevOps Metrics: Deployment Frequency, Lead Time, Change Failure Rate, MTTR, and Customer Satisfaction
JD Tech
JD Tech
Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

Incident ResponseMTTRSRE
0 likes · 26 min read
Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
Architects Research Society
Architects Research Society
Oct 5, 2023 · Fundamentals

Understanding Stability and Reliability Testing in Software Development

This article explains the definitions, objectives, importance, and types of stability and reliability testing in software development, highlighting how these tests improve system availability, reduce failure risk, and guide corrective actions to lower maintenance costs.

MTBFMTTRquality assurance
0 likes · 14 min read
Understanding Stability and Reliability Testing in Software Development
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 25, 2023 · Operations

Understanding MTTR, MTBF, and MTTF: Fault Metrics for Reliability Engineering

This article explains the essential fault metrics MTTR, MTBF, and MTTF, their definitions, calculations, and practical importance for SRE and operations teams to improve system availability, guide maintenance strategies, and make data‑driven reliability decisions.

MTBFMTTFMTTR
0 likes · 11 min read
Understanding MTTR, MTBF, and MTTF: Fault Metrics for Reliability Engineering
JD Retail Technology
JD Retail Technology
Jun 14, 2023 · Backend Development

Reducing MTTR in a High‑Availability SaaS Platform through Chaos Engineering and Middleware Resilience

This article explains how a SaaS platform for employee incentives reduces mean time to recovery (MTTR) during large‑scale promotions by applying chaos‑engineering drills, automating fault detection, and leveraging JSF middleware features such as timeout‑retry, adaptive load balancing, and circuit breaking to improve overall system stability.

Backend ResilienceChaos EngineeringMTTR
0 likes · 13 min read
Reducing MTTR in a High‑Availability SaaS Platform through Chaos Engineering and Middleware Resilience
HelloTech
HelloTech
Nov 22, 2022 · Operations

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

High AvailabilityMTBFMTTR
0 likes · 15 min read
Guidelines for Incident Postmortem and Fault Review
Architects Research Society
Architects Research Society
Sep 14, 2022 · Fundamentals

Understanding Stability and Reliability Testing in Software Development

This article explains the definitions, objectives, importance, and various types of stability and reliability testing—including stress, recovery, failover, and stability tests—while highlighting how these practices reduce system failures, improve MTBF/MTTR, and support informed decision‑making for software quality assurance.

MTBFMTTRperformance testing
0 likes · 11 min read
Understanding Stability and Reliability Testing in Software Development
DevOps
DevOps
Aug 17, 2022 · Operations

Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips

This article explains why measuring is essential for continuous delivery, introduces four valuable metrics—deployable package count, cycle time, mean time between failures, and mean time to recovery—and offers practical tips to improve delivery speed and reliability.

Continuous DeliveryDevOpsMTBF
0 likes · 7 min read
Measuring Success in Continuous Delivery: Four Key Metrics and Practical Tips
vivo Internet Technology
vivo Internet Technology
Nov 17, 2021 · Operations

Design and Architecture of a Unified Alert Convergence System for Monitoring

The paper presents a unified alert convergence system that centralizes metric calculation, detection, and alarm handling across monitoring subsystems, employing mechanisms such as convergence, claiming, silencing, escalation, and a Redis‑based delayed queue integrated via Kafka or REST to reduce alarm fatigue, improve MTTA/MTTR, and enable future AI‑driven AIOps.

Alert ConvergenceMTTAMTTR
0 likes · 18 min read
Design and Architecture of a Unified Alert Convergence System for Monitoring
Wukong Talks Architecture
Wukong Talks Architecture
Jul 14, 2021 · Operations

Understanding High Availability: Lessons from the Bilibili Outage

This article analyzes Bilibili's recent service disruption, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region active‑active deployments to improve system reliability.

HAHigh AvailabilityMTBF
0 likes · 13 min read
Understanding High Availability: Lessons from the Bilibili Outage
HelloTech
HelloTech
Jul 12, 2021 · Operations

Introduction to System Stability: Concepts, Metrics, and Practices

The article explains Haro’s approach to system stability—defining high‑availability, key metrics such as SLA, RPO/RTO, MTTR/MTBF, and the 5‑5‑10 rule—while outlining cultural and technical safeguards, full‑team participation, process integration, and incremental tooling to prevent faults and ensure rapid recovery.

High AvailabilityMTTRRPO
0 likes · 11 min read
Introduction to System Stability: Concepts, Metrics, and Practices
Continuous Delivery 2.0
Continuous Delivery 2.0
Jan 19, 2021 · Operations

Understanding MTTR, MTBF, and MTTF: Key Reliability Metrics for SRE

This article explains the definitions, calculations, and practical importance of MTTR, MTBF, and MTTF for reliability engineering, showing how accurate data and proper metric use enable SRE teams to improve system availability, plan maintenance, and reduce downtime.

MTBFMTTFMTTR
0 likes · 13 min read
Understanding MTTR, MTBF, and MTTF: Key Reliability Metrics for SRE
Big Data Technology Architecture
Big Data Technology Architecture
Apr 29, 2020 · Databases

Enhancing HBase CAP Model and MTTR with Kafka‑Based IO Decoupling and Native AP Support

The article analyzes HBase's CP‑oriented CAP limitations, proposes native AP support via Replica, decouples WAL IO to Kafka, optimizes MTTR, introduces multi‑datacenter active/active disaster recovery, and redesigns client write paths and LogSplit processing for higher availability and throughput.

CAPDatabase ArchitectureHBase
0 likes · 11 min read
Enhancing HBase CAP Model and MTTR with Kafka‑Based IO Decoupling and Native AP Support