What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review
This article walks through Yuque’s October 23 service disruption, detailing each timeline milestone, analyzing the root causes, highlighting the importance of monitoring and data integrity checks, and offering concrete post‑mortem recommendations to improve future incident handling.
Hello, I’m Su San.
Yesterday Yuque published an outage announcement for October 23. The announcement link is provided in the original post.
Incident Timeline
14:07 – The data storage operations team received an alert from the monitoring system; the cause was a bug in a new operations tool that took storage nodes offline during an upgrade.
14:15 – The hardware team was contacted to try to bring the offline machines back online.
15:00 – Because the machines were too old to be brought back directly, the recovery plan switched to restoring data from backup systems.
15:10 – Data restoration began from backups; due to the large data volume, the process took a long time.
19:00 – Data restoration completed; a two‑hour data integrity verification followed.
21:00 – The storage system passed integrity checks and began integration testing with the Yuque team.
22:00 – All Yuque services were restored, and no user data was lost.
Key observations:
14:07 – Early Detection
The monitoring system detected the issue seven minutes before users noticed it, demonstrating effective early warning capabilities. In many companies, incidents discovered by monitoring before user impact receive lighter penalties.
14:15 – Hardware Intervention
The brief note about contacting the hardware team hints at physical actions in the data center, such as swapping drives, though details are sparse.
15:00 – Decision Window
During the 45‑minute window, the original plan to bring machines back online was deemed infeasible, leading to a new strategy: restore from backups. This required leadership approval and communication of a potentially long outage.
Estimated latest completion by today, earliest by 6 PM.
The estimate reflects an optimistic three‑hour repair window plus additional time for unforeseen issues.
15:10 – Data Restoration
Restoration from backup began, with the duration proportional to data size. The author notes personal experience with similar data‑fixing efforts that can take weeks.
19:00 – Completion and Verification
Restoration finished in four hours, only one hour longer than the optimistic estimate. A subsequent two‑hour data integrity check ensured no data loss, a critical step to maintain user trust.
21:00 & 22:00 – Final Checks and Service Resumption
After passing integrity checks, the storage system was integrated with Yuque, and full service was restored with all user data intact.
Improvement Measures
Guiding principle: "Monitorable, Gradual, Rollbackable".
Capability upgrade: Move from same‑region multi‑replica to a two‑region three‑center high‑availability architecture.
Regular drills: Conduct periodic disaster‑recovery and emergency response exercises.
All developers, operations, and QA staff should keep the nine‑character mantra in mind when designing, coding, testing, and deploying.
Final Thoughts
The outage serves as a valuable learning case. Yuque offered six months of membership as compensation.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.