Fundamentals 13 min read

Understanding Reliability Evaluation for Storage, Servers, and Distributed Systems

The article explains how reliability of storage, servers, and distributed systems is assessed using standards, models like MTBF/MTTR, RAS features, CAP/BASE theories, and end‑to‑end solutions, emphasizing the gap between theoretical metrics and real‑world operational data.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Understanding Reliability Evaluation for Storage, Servers, and Distributed Systems

Reliability has long been a hot topic of discussion; with the development of distributed systems, reliability issues have become a required course for every storage, distributed system, converged system, and server vendor.

Because data is the core asset of customers, the reliability of devices dealing with data is especially important. The server's RAS features (Reliability, Availability, Serviceability, and Intel also introduced Manageability for monitoring, managing, and predicting errors) directly determine the server's tier and quality, whether RAS 1.0 or the currently popular RAS 2.0. The close relationship between storage and data means storage systems should be inherently reliable; how should storage system reliability be evaluated? Vendors claim 5‑9, 6‑9 or higher reliability standards—how are these numbers derived, and do they truly indicate system reliability?

Regarding storage reliability assessment, there are different industry standards and theoretical frameworks, and each vendor has its own interpretation. Examples include the US electronic equipment reliability standard MIL‑HDBK‑217, the UK telecom reliability standard HRD5, BELLCORE TR‑332, etc. The most widely used reliability model is BELLCORE TR‑332 "Reliability Prediction Procedure for Electronic Equipment", an international standard for commercial products. This standard provides several unit reliability prediction methods; a commonly used one is the counting method, which calculates failure rate under 40 °C operating temperature and 50 % electrical stress.

For systems composed of redundant units, Markov state diagrams are generally used for reliability modeling. For systems composed of series units, the availability of each unit is multiplied to obtain system availability. Based on module and component MTBF (Mean Time Between Failures, generally for repairable systems) and MTTR (Mean Time To Repair), combined with cascading and series components, system availability is calculated as A = MTBF / (MTBF + MTTR).

Storage reliability can be modeled based on different standards and theories, and prediction methods also vary. First, using different models and methods affects the results; second, vendor promotions often lack official supporting material and usually present the best case, while real customer environments are complex. Most importantly, these are theoretical values; the gap between theory and practice is often large. From a customer's perspective, they should pay more attention to the vendor's own equipment's real‑world operating time, failure rates, unit counts, and industry distribution.

These paper‑based promotions may have some effect for certain customers. However, in my view, as hardware technology and processes improve, the reliability of chips and modules from different hardware vendors is generally at the same level, and most hardware comes from a few chip manufacturers. Device vendors use these chips to develop their products; aside from architecture, management, usability, and service optimizations, reliability ultimately depends on the underlying chips and architecture. Therefore, discussing reliability solely at the product level is of limited value; it should be considered together with product features, software enhancements, and solution‑level aspects.

Talk about servers

Server products' RAS design mainly originates from CPU, chip, and server manufacturers. CPU‑provided RAS functions are enabled during BIOS initialization and mainly serve higher‑level RAS or extensions. System‑level RAS features rely on vendor designs such as partitioning, memory mirroring, hot‑swap of CPU and memory, etc.

The main metric for evaluating a server system is RAS (reliability is an important indicator). Server RAS covers each component, especially CPU, memory, and management clock modules. Each module's RAS includes many aspects; for memory, RAS involves cache protection, fault inspection, error correction, faulty memory isolation, and location.

To enhance server‑level RAS capability and competitiveness, server vendors need to implement differentiated designs at the firmware level, such as proactive warning of risky components like memory and CPU, collecting fault data automatically independent of the OS, performing out‑of‑band fault collection, analysis, localization, warning, and task dispatch.

Talk about distributed systems

Distributed systems (including distributed storage systems) can also use RAS, MTBF, MTTR as reliability metrics, but the widely recognized professional metric is the CAP indicator, where availability is an important factor. CAP theory states that in distributed system design, no design can simultaneously satisfy consistency, availability, and partition tolerance. Since they cannot all be met, promoting 5‑9 or 6‑9 availability is hard to convince; most vendors sacrifice consistency to achieve availability and partition tolerance, though consistency remains a key design consideration.

In distributed system design, the BASE theory, a compromise or extension of CAP, is heavily used. The reliability competitiveness of distributed systems seems to rely more on upper‑layer software capabilities (perhaps because cheap x86 servers are used, and hardware aims for low cost), such as multiple data copies, EC erasure coding (storing data across nodes, tolerating multiple node failures), data scrubbing, ECC, storage pool fault domain isolation, fast data repair, etc. Solution‑level reliability is also crucial, including remote replication, multiple availability zones, regions, etc.

Talk about storage systems

Storage systems, especially enterprise storage, hold customers' core data assets. Relying solely on reliability model‑derived 5‑9 or 6‑9 theoretical numbers cannot meet real business reliability requirements. Therefore, the end‑to‑end reliability of storage solutions is a core competitive advantage.

Hard disks are the basic storage unit; sulfuration, high pressure, high temperature, dust, vibration, and head wear can cause disk failures. In addition to protecting individual disks from sulfuration and dust, handling high temperature, vibration, bad block repair, and fault prediction also need to be considered in system design. Volume mirroring, snapshots, multipathing, performance acceleration (including hardware acceleration), and data protection functions also require system‑level design and rigorous testing to ensure reliable, stable storage and data safety.

RAID is the most basic technology for redundant reliable data. Currently RAID technologies are numerous; besides block‑based RAID such as CRAID, FastRAID, RAID2.0, VRAID, there are composite RAID optimizations like RAID DP, RAID5E, RAID MP, RAID ADG, many of which support cross‑cabinet data distribution, improving reconstruction speed and cabinet reliability.

In data centers, storage systems provide end‑to‑end data consistency to ensure data integrity, e.g., SCSI‑based DIX+T10 PI end‑to‑end consistency technology. When the storage system supports the standard T10 PI, together with specific databases (Oracle 11g), operating systems (Oracle Linux 5 or 6), and HBAs (Emulex, Qlogic models), data consistency is guaranteed.

In storage systems, silent data corruption cannot be detected in real time, making recovery difficult or impossible; therefore the solution provides end‑to‑end consistency features to ensure data reliability.

Cross‑region, multi‑data‑center reliability enables customers' applications and data to tolerate disasters across cities or data centers; this is the real capability customers need. Supporting technologies include data replication, active‑active, multi‑data‑center (e.g., 3DC) solutions. Each solution fits different scenarios and requirements, depending on RTO, RPO, distance, network latency, etc. Hence, rather than merely promoting a product's hardware reliability numbers, building software and solution strength and proving reliability with real‑world operational data is more valuable.

Warm reminder: Please search for “ICT_Architect” or scan the QR code below to follow the public account for more content.

Distributed SystemsReliabilityStorageavailabilityMTBFRASservers
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.