Big Data 11 min read

How Tencent Cloud Keeps Big Data Disks Reliable: Inside Their Health Assurance Plan

This article examines the challenges of hard‑disk reliability in large‑scale big‑data services, explains how Tencent Cloud reduces failure rates through hardware and software optimizations, custom collaborations, pre‑deployment health checks, scoring systems, and usage‑pattern improvements, and reveals the comprehensive strategies that keep data storage stable and performant.

Tencent Tech
Tencent Tech
Tencent Tech
How Tencent Cloud Keeps Big Data Disks Reliable: Inside Their Health Assurance Plan

Big Data Disk Application Status

Hard disks serve as the memory of servers, and any failure can cause data loss similar to a brain’s hippocampus malfunction. In B2B big‑data services, massive and precise storage is critical. Tencent Cloud has maintained an extremely low disk failure rate through strong technical safeguards.

Technical Reliability Challenges

As disk capacities grow from 5‑6 TB to 16‑18 TB, more platters and heads are packed into the same 3.5" form factor, increasing mechanical failure probability and reducing tolerance to vibration. The head‑disk clearance shrinks to 1‑2 nm, raising the risk of head‑disk collisions and signal degradation. Higher areal density pushes signal‑to‑noise ratios toward the Shannon limit, dramatically increasing bit‑error rates.

Usage‑Level Factors Affecting Disk Reliability

Disk failures manifest as drop‑outs or read‑only errors, caused by internal component issues, link problems, or protocol anomalies. Analysis of 90+ failed disks shows 75% have internal anomalies (head‑disk, servo, read/write errors) while 25% appear normal, with failures often triggered by high workloads, HBA/RAID resets, or transient communication timeouts.

Higher workloads increase retry and error‑correction activity, which can throttle I/O and raise drop‑out probability. Comparative studies of three internal business scenarios reveal that the scenario with the highest workload also has the highest failure rate and lowest health scores.

Optimization Directions

1. Customized Mechanisms

Tencent Cloud’s supply‑chain team partnered with Seagate to co‑develop custom disk features and logging, establishing a joint testing pipeline that has noticeably reduced Seagate’s annualized failure rate.

2. Pre‑deployment Health Checks

Disk failure follows the bathtub curve (early failure, stable period, wear‑out). Conducting systematic health checks before servers go live helps intercept early‑life defects, moving disks quickly into the stable phase and reducing DOA incidents.

3. Disk Health Scoring System

A mathematical model evaluates dynamic health parameters and statistical deviations across clusters, assigning a health score to each disk. Operations can monitor scores in real time and proactively replace low‑scoring disks in critical workloads.

4. Usage Habit Optimization

Adopting multi‑replica Hadoop architectures, writing once and reading many times, and placing data near compute nodes mitigates the impact of single‑disk failures. When a disk fails, it is marked dirty and traffic is rerouted to healthy disks, ensuring seamless data availability.

Conclusion

As SSDs replace low‑capacity HDDs, large‑capacity mechanical disks remain essential for big‑data workloads, but their reliability margin shrinks. Until next‑generation magnetic recording technologies arrive, cloud providers must deepen hardware collaborations, enforce rigorous health checks, and design fault‑tolerant architectures to keep disks online and maintain competitive advantage in B2B big‑data services.

cloud infrastructureoperational optimizationbig data storagedisk health scoringhard disk reliability
Tencent Tech
Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.