Operations 9 min read

How Tencent Cloud Recovered Lost Data in a 2‑Day Storage Crisis

In a two‑day incident, Tencent Cloud's CBS team diagnosed cell failures, implemented directed reads and a dual‑cell merge strategy, and restored three‑copy data integrity, while uncovering monitoring gaps and tool limitations that inform future storage operations.

ITPUB

Sep 21, 2016

How Tencent Cloud Recovered Lost Data in a 2‑Day Storage Crisis

Background : After more than a year of working with distributed storage, the team faced a critical situation where all three data replicas became abnormal, threatening data loss for cloud storage customers.

Alert : An alarm reported five small tables ("dead" tables) with no free tables for migration, prompting immediate operational response.

Initial Analysis : A cell machine had failed the previous day; the system automatically removed it, leaving only two cells (Cell2 and Cell3) and reducing data copies to two. The distribution after removal is shown below:

Automatic disaster migration was triggered, but migration failed due to read errors from the problematic cell's disk, as seen in the next image:

Disk Inspection : Using dmesg, the team confirmed that the disk in the failing cell exhibited errors, illustrated below:

Because the online dbtrasf migration module did not support specifying a cell IP, the team devised a directed‑read strategy.

Directed Read Strategy

After discussing with developers, they modified the migration tool to allow reading from a specific cell. Development took about 30 minutes, followed by self‑testing. The updated tool was deployed to the test and simulation environments, where data consistency checks showed no anomalies. The production migration then proceeded, successfully migrating four small tables; one table still failed and required further attention.

Multiple Sector Errors

Further analysis with smatctl revealed 14 sector errors on 10.53.65.214 (Cell3) and over 800 sector errors on 10.53.65.101 (Cell2), indicating varying degrees of data corruption across the three copies. The sector error distribution is illustrated:

Dual‑Cell Merge Plan : To recover the remaining table, the team decided to merge data from both cells:

Attempt to read from Cell3 (10.53.65.214).

If reading fails, attempt to read from Cell2 (10.53.65.101).

Sample log of the read attempts:

read from the first[diskid=290763668122043122, lba=1069470973952, sid=1]
[2016-09-09 16:13:38] read from the second[diskid=290763668122043122, lba=1069470973952, sid=1]

Even after the dual‑cell merge, the problematic table could not be fully restored, prompting further investigation.

Final Recovery via Removed Cell

Developers discovered that the metadata for the faulty diskid was identical across all three cells, meaning the removed cell still held intact data. By modifying the tool to read from the removed machine, they successfully restored the missing data, as confirmed by the final migration log and the image below:

With three replicas restored, the two‑day, one‑night data rescue concluded successfully.

Lessons Learned

Monitoring Gaps : Existing monitoring only captured I/O errors; fine‑grained disk‑level monitoring is needed to detect sparse sector failures.

Tool Support : The incident highlighted the lack of flexible data‑repair tools; new solutions and utilities are being designed for CBS.

Program Logic Improvements : Current retry logic only attempts the same cell; future versions will add cross‑cell retry mechanisms.

Collaboration : The case deepened cooperation between development and operations, showcasing professional responsibility and passion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

incident response Data Recovery cloud storage Distributed storage

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.