Industry Insights 16 min read

How to Quantify Data Reliability in Distributed Storage Systems

This article analyzes the quantitative model for data reliability in distributed storage, covering factors such as disk count, replication factor, recovery time, annualized failure rate, and copyset configuration, and derives formulas to estimate yearly data loss probability for both replica and erasure‑coding schemes.

vivo Internet Technology

Jul 28, 2021

How to Quantify Data Reliability in Distributed Storage Systems

1. Introduction

Reliability of a distributed storage system is measured by the probability that data becomes unreadable. This article quantifies data reliability using replication and erasure‑coding techniques.

2. Failure Sources

Data loss originates from three categories of failures:

Hardware failures : disk, network, server or data‑center outages.

Software bugs : kernel or design defects.

Operational errors : human mistakes.

Disk failures are the most frequent hardware failure and therefore the primary focus for reliability modeling.

3. Factors Influencing Data Reliability

The following parameters determine the yearly reliability of a storage cluster:

N : total number of disks in the cluster.

R : replication factor (number of copies of each object).

T : recovery time after a disk failure.

AFR : Annualized Failure Rate of a single disk (probability of failure within one year).

S : number of copysets (groups of replicas) that store a given object.

The overall yearly reliability can be expressed by the formula shown in the image below:

3.1 Disk Annual Failure Rate (AFR)

AFR is derived from the Mean Time Between Failures (MTBF). For a disk with MTBF = 120 000 h, the AFR is:

Google’s production clusters report an AFR distribution over five years, illustrated below:

3.2 CopySet (Replica Group)

A copyset is the set of disks that hold all replicas of a particular object. If all disks in a copyset fail, the object is lost. The diagram shows random copyset placement across nine disks.

The maximum copyset size and the minimum copyset size (N/R) determine how dispersed data are and thus affect loss probability.

3.3 Recovery Time (T)

Shorter recovery time reduces the window during which additional failures can cause data loss. Assuming a disk bandwidth of 200 MB/s and allocating 20 % (40 MB/s) for recovery, the recovery speed is calculated as:

4. Reliability Model Derivation

4.1 Disk Failures and Poisson Distribution

Disk failures follow a Poisson process. The probability of observing n failures in t hours among N disks is:

AFR can be converted to an hourly failure probability (FIT). The expected number of disk failures per hour is FIT × N:

4.2 Yearly Reliability for Replication

For a three‑replica scheme, data loss occurs when:

The first disk fails during the year.

A second disk fails within the recovery window tr.

A third disk fails within the same window.

The three failed disks belong to the same copyset (probability Pc).

The combined loss probability P is:

Yearly reliability is 1 − P.

4.3 Yearly Reliability for Erasure Coding

For an erasure‑coding scheme with parameters (D, E) where up to E blocks may be lost, the loss probability (with E = 4) is:

5. Reliability Estimation

The model shows that reliability depends on the following variables:

N : total number of disks.

FIT : hourly failure rate derived from AFR.

t : observation period (one year).

tr : recovery time (hours), which is a function of recovery bandwidth, disk capacity and block size.

R : replication factor.

Z : total storage capacity of the cluster.

B : block or file size used for data placement.

Plugging realistic cluster parameters into the formulas yields quantitative “nines” of reliability. Example scenarios illustrate how increasing N, improving recovery bandwidth, lowering AFR, or adjusting block size affect the overall reliability.

6. Key Findings

AFR is the dominant factor; reducing AFR directly improves reliability.

Increasing recovery bandwidth (while preserving service availability) shortens tr and markedly raises reliability.

When recovery speed is constrained, decreasing data dispersion (i.e., reducing the number of copysets by using larger block granularity) further improves reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed storage Data Reliability erasure coding AFR copyset reliability modeling

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.