Operations 27 min read

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

The article reviews traditional availability metrics such as Success‑Ratio, Error‑Budget, MTTR/MTTF, SLA/SLO, and highlights their limitations, then introduces Google’s User‑Uptime and Windowed User‑Uptime metrics, explains their definitions, challenges, experimental results, and why they provide a more user‑centric view of service reliability.

dbaplus Community
dbaplus Community
dbaplus Community
Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

Classic Availability Metrics

Commonly used metrics for quantifying service availability include:

Success‑Ratio : proportion of successful requests over total requests. Simple but can be skewed by high‑frequency users.

Incident‑Ratio : uptime minutes divided by total minutes. Intuitive but binary and hard to apply to large distributed systems.

MTTF / MTTR / MTBF : mean time to failure, mean time to recovery, and mean time between failures. Useful for macro‑level health but only applicable to binary up/down states.

Error‑Budget : predefined availability target expressed as a time budget. Easy to understand but lacks granularity.

SLA / SLO / SLI : service‑level agreement, objective, and indicator. Industry standard but complex to set and maintain.

User‑Uptime

Google G Suite introduced User‑Uptime , defined as the total time each user experiences successful service divided by the total time the user is active (including failures). The metric aggregates per‑user uptime across all users:

user‑uptime = Σ_u (successful_time_u) / Σ_u (active_time_u)

Advantages

Reflects actual user‑perceived availability.

Handles heterogeneous user behavior without manually set thresholds.

Key challenges

How to treat the interval between a successful and a failed request – as uptime or downtime.

How to ignore long idle periods. Google uses a cutoff time (the 99th percentile of inter‑request intervals, e.g., 30 minutes for Gmail) to exclude such gaps from the calculation.

Experimental simulation

Google simulated traffic with a 15‑minute fault injected between minutes 30‑45. Two representative users (user0, user1) showed that Success‑Ratio can remain high while User‑Uptime captures the degradation experienced by individual users.

Results from thousands of runs:

Mean availability: Success‑Ratio ≈ 75.8 %, User‑Uptime ≈ 74.2 % (both close to the expected 75 %).

Standard deviation: Success‑Ratio = 0.078, User‑Uptime = 0.049 – User‑Uptime is less noisy because it is less affected by retry behavior.

Windowed User‑Uptime and Minimal Cumulative Ratio (MCR)

To expose worst‑case availability over different time scales, Google defined Windowed User‑Uptime (WUU) :

Choose a window size (e.g., 1 min, 5 min, 1 h).

Slice the observation period into equal windows.

Compute User‑Uptime for each slice.

Take the minimum value across all slices – this is the WUU for that window.

The series of minima across increasing windows forms the Minimal Cumulative Ratio (MCR) curve, analogous to the Minimum‑Mutator‑Utilization (MMU) concept in garbage‑collector theory. MCR is monotonic: larger windows cannot produce a lower minimum.

Example over three days:

1‑minute windows (4 320 slices) → minimum 92 %.

5‑minute windows (864 slices) → minimum 94.3 %.

15‑minute, 1‑hour, 6‑hour, 1‑day windows → minima 96 %, 97.2 %, 99.2 %, 99.96 % respectively.

Plotting these points yields the MCR curve, revealing both overall availability and the severity/duration of the worst incidents.

Comparison on Google G Suite data (2019)

User‑Uptime consistently exceeds Success‑Ratio, indicating a higher perceived availability for end users.

High‑frequency users (≈ 1 % of users but ≈ 62 % of requests) inflate Success‑Ratio during faults, while User‑Uptime remains stable.

MCR visualizations show that User‑Uptime provides clearer insight into the impact of abusive or edge‑case traffic that Success‑Ratio masks.

Conclusion

Traditional metrics such as Success‑Ratio, Incident‑Ratio, and MTTR still have value, but User‑Uptime and its windowed extensions add a user‑centric dimension that is less susceptible to skew from high‑frequency or abusive traffic. They complement existing SRE tools, offering more nuanced reliability decisions without being a universal silver bullet.

References

Meaningful Availability (USENIX NSDI 2020) – https://www.usenix.org/system/files/nsdi20spring_hauer_prepub.pdf

GC MMU Theory – https://www.cs.cmu.edu/~guyb/papers/gc2001.pdf

Four Golden Signals – https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden_signals

Amazon AWS SLA – https://aws.amazon.com/cn/compute/sla/

Alibaba Cloud ECS SLA – http://terms.aliyun.com/legal-agreement/terms/suit_bu1_ali_cloud/suit_bu1_ali_cloud201909241949_62160.html

NetEase Cloud IM SLA – https://yunxin.163.com/clauses?serviceType=1

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringmetricsSREAvailabilityuser-uptime
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.