Operations 12 min read

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

This article walks through Yuque’s October 23 service disruption, detailing each timeline milestone, analyzing the root causes, highlighting the importance of monitoring and data integrity checks, and offering concrete post‑mortem recommendations to improve future incident handling.

Su San Talks Tech

Oct 27, 2023

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

Hello, I’m Su San.

Yesterday Yuque published an outage announcement for October 23. The announcement link is provided in the original post.

Incident Timeline

14:07 – The data storage operations team received an alert from the monitoring system; the cause was a bug in a new operations tool that took storage nodes offline during an upgrade.

14:15 – The hardware team was contacted to try to bring the offline machines back online.

15:00 – Because the machines were too old to be brought back directly, the recovery plan switched to restoring data from backup systems.

15:10 – Data restoration began from backups; due to the large data volume, the process took a long time.

19:00 – Data restoration completed; a two‑hour data integrity verification followed.

21:00 – The storage system passed integrity checks and began integration testing with the Yuque team.

22:00 – All Yuque services were restored, and no user data was lost.

Key observations:

14:07 – Early Detection

The monitoring system detected the issue seven minutes before users noticed it, demonstrating effective early warning capabilities. In many companies, incidents discovered by monitoring before user impact receive lighter penalties.

14:15 – Hardware Intervention

The brief note about contacting the hardware team hints at physical actions in the data center, such as swapping drives, though details are sparse.

15:00 – Decision Window

During the 45‑minute window, the original plan to bring machines back online was deemed infeasible, leading to a new strategy: restore from backups. This required leadership approval and communication of a potentially long outage.

Estimated latest completion by today, earliest by 6 PM.

The estimate reflects an optimistic three‑hour repair window plus additional time for unforeseen issues.

15:10 – Data Restoration

Restoration from backup began, with the duration proportional to data size. The author notes personal experience with similar data‑fixing efforts that can take weeks.

19:00 – Completion and Verification

Restoration finished in four hours, only one hour longer than the optimistic estimate. A subsequent two‑hour data integrity check ensured no data loss, a critical step to maintain user trust.

21:00 & 22:00 – Final Checks and Service Resumption

After passing integrity checks, the storage system was integrated with Yuque, and full service was restored with all user data intact.

Improvement Measures

Guiding principle: "Monitorable, Gradual, Rollbackable".

Capability upgrade: Move from same‑region multi‑replica to a two‑region three‑center high‑availability architecture.

Regular drills: Conduct periodic disaster‑recovery and emergency response exercises.

All developers, operations, and QA staff should keep the nine‑character mantra in mind when designing, coding, testing, and deploying.

Final Thoughts

The outage serves as a valuable learning case. Yuque offered six months of membership as compensation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Cloud Services disaster recovery Incident Response postmortem

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.