Big Data 11 min read

How Tencent Cloud Keeps Big Data Disks Reliable: Inside Their Health Assurance Plan

This article examines the challenges of hard‑disk reliability in large‑scale big‑data services, explains how Tencent Cloud reduces failure rates through hardware and software optimizations, custom collaborations, pre‑deployment health checks, scoring systems, and usage‑pattern improvements, and reveals the comprehensive strategies that keep data storage stable and performant.

Tencent Tech

Mar 31, 2020

How Tencent Cloud Keeps Big Data Disks Reliable: Inside Their Health Assurance Plan

Big Data Disk Application Status

Hard disks serve as the memory of servers, and any failure can cause data loss similar to a brain’s hippocampus malfunction. In B2B big‑data services, massive and precise storage is critical. Tencent Cloud has maintained an extremely low disk failure rate through strong technical safeguards.

Technical Reliability Challenges

As disk capacities grow from 5‑6 TB to 16‑18 TB, more platters and heads are packed into the same 3.5" form factor, increasing mechanical failure probability and reducing tolerance to vibration. The head‑disk clearance shrinks to 1‑2 nm, raising the risk of head‑disk collisions and signal degradation. Higher areal density pushes signal‑to‑noise ratios toward the Shannon limit, dramatically increasing bit‑error rates.

Usage‑Level Factors Affecting Disk Reliability

Disk failures manifest as drop‑outs or read‑only errors, caused by internal component issues, link problems, or protocol anomalies. Analysis of 90+ failed disks shows 75% have internal anomalies (head‑disk, servo, read/write errors) while 25% appear normal, with failures often triggered by high workloads, HBA/RAID resets, or transient communication timeouts.

Higher workloads increase retry and error‑correction activity, which can throttle I/O and raise drop‑out probability. Comparative studies of three internal business scenarios reveal that the scenario with the highest workload also has the highest failure rate and lowest health scores.

Optimization Directions

1. Customized Mechanisms

Tencent Cloud’s supply‑chain team partnered with Seagate to co‑develop custom disk features and logging, establishing a joint testing pipeline that has noticeably reduced Seagate’s annualized failure rate.

2. Pre‑deployment Health Checks

Disk failure follows the bathtub curve (early failure, stable period, wear‑out). Conducting systematic health checks before servers go live helps intercept early‑life defects, moving disks quickly into the stable phase and reducing DOA incidents.

3. Disk Health Scoring System

A mathematical model evaluates dynamic health parameters and statistical deviations across clusters, assigning a health score to each disk. Operations can monitor scores in real time and proactively replace low‑scoring disks in critical workloads.

4. Usage Habit Optimization

Adopting multi‑replica Hadoop architectures, writing once and reading many times, and placing data near compute nodes mitigates the impact of single‑disk failures. When a disk fails, it is marked dirty and traffic is rerouted to healthy disks, ensuring seamless data availability.

Conclusion

As SSDs replace low‑capacity HDDs, large‑capacity mechanical disks remain essential for big‑data workloads, but their reliability margin shrinks. Until next‑generation magnetic recording technologies arrive, cloud providers must deepen hardware collaborations, enforce rigorous health checks, and design fault‑tolerant architectures to keep disks online and maintain competitive advantage in B2B big‑data services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud infrastructure operational optimization big data storage disk health scoring hard disk reliability

Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.