How Tencent Cloud Recovered Lost Data in a 2‑Day Storage Crisis
In a two‑day incident, Tencent Cloud's CBS team diagnosed cell failures, implemented directed reads and a dual‑cell merge strategy, and restored three‑copy data integrity, while uncovering monitoring gaps and tool limitations that inform future storage operations.
Background : After more than a year of working with distributed storage, the team faced a critical situation where all three data replicas became abnormal, threatening data loss for cloud storage customers.
Alert : An alarm reported five small tables ("dead" tables) with no free tables for migration, prompting immediate operational response.
Initial Analysis : A cell machine had failed the previous day; the system automatically removed it, leaving only two cells (Cell2 and Cell3) and reducing data copies to two. The distribution after removal is shown below:
Automatic disaster migration was triggered, but migration failed due to read errors from the problematic cell's disk, as seen in the next image:
Disk Inspection : Using dmesg, the team confirmed that the disk in the failing cell exhibited errors, illustrated below:
Because the online dbtrasf migration module did not support specifying a cell IP, the team devised a directed‑read strategy.
Directed Read Strategy
After discussing with developers, they modified the migration tool to allow reading from a specific cell. Development took about 30 minutes, followed by self‑testing. The updated tool was deployed to the test and simulation environments, where data consistency checks showed no anomalies. The production migration then proceeded, successfully migrating four small tables; one table still failed and required further attention.
Multiple Sector Errors
Further analysis with smatctl revealed 14 sector errors on 10.53.65.214 (Cell3) and over 800 sector errors on 10.53.65.101 (Cell2), indicating varying degrees of data corruption across the three copies. The sector error distribution is illustrated:
Dual‑Cell Merge Plan : To recover the remaining table, the team decided to merge data from both cells:
Attempt to read from Cell3 (10.53.65.214).
If reading fails, attempt to read from Cell2 (10.53.65.101).
Sample log of the read attempts:
read from the first[diskid=290763668122043122, lba=1069470973952, sid=1]
[2016-09-09 16:13:38] read from the second[diskid=290763668122043122, lba=1069470973952, sid=1]Even after the dual‑cell merge, the problematic table could not be fully restored, prompting further investigation.
Final Recovery via Removed Cell
Developers discovered that the metadata for the faulty diskid was identical across all three cells, meaning the removed cell still held intact data. By modifying the tool to read from the removed machine, they successfully restored the missing data, as confirmed by the final migration log and the image below:
With three replicas restored, the two‑day, one‑night data rescue concluded successfully.
Lessons Learned
Monitoring Gaps : Existing monitoring only captured I/O errors; fine‑grained disk‑level monitoring is needed to detect sparse sector failures.
Tool Support : The incident highlighted the lack of flexible data‑repair tools; new solutions and utilities are being designed for CBS.
Program Logic Improvements : Current retry logic only attempts the same cell; future versions will add cross‑cell retry mechanisms.
Collaboration : The case deepened cooperation between development and operations, showcasing professional responsibility and passion.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
