How Youzan Manages Online Incidents: A Step‑by‑Step Guide
This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.
Online faults refer to situations where an IT service provided to customers becomes partially or completely unavailable, including performance degradation such as increased latency that harms user experience.
In early‑stage startups, the rush to release new features often outweighs quality concerns, creating technical debt that can trigger online faults, leading to degraded customer experience and financial loss.
The goal of fault management is to restore services to normal operation as quickly as possible while minimizing adverse business impact, thereby preserving service quality and availability.
After a fault occurs, an emergency response team locates, analyzes, and resolves the issue, then conducts a post‑mortem review to create actionable items that improve handling efficiency and prevent recurrence.
The following sections briefly introduce Youzan's fault‑management practice.
Fault‑Handling Process Overview
Youzan uses JIRA as a cross‑department collaboration tool, and the online fault‑management workflow is built on JIRA. Fault tickets follow a defined workflow, and each fault Action is created as a sub‑task under the main JIRA ticket, using JIRA's default sub‑task workflow.
Confirm Fault and Notify Coordinator
When a potential fault is reported by a customer, internal staff, or monitoring system, the reporter quickly validates its legitimacy.
Once confirmed, a fault JIRA ticket is created and the fault coordinator (from the R&D efficiency team) is notified to synchronize information between business, technical, and product teams.
The coordinator ensures all relevant departments are informed and posts the fault to the "Availability Assurance" WeChat group, where investigation and discussion take place either in that group or a dedicated fault‑handling group.
Locate and Resolve the Fault
To avoid unrelated noise, the fault‑handling team forms an emergency response group (via WeChat or co‑location) to improve efficiency.
After identifying the root cause, the handler updates the coordinator with the cause and estimated repair time. For long‑running incidents, the team provides progress updates to business stakeholders every half hour.
Fault Recovery
If the fault is caused by a recent release, the code is rolled back to a stable version preceding the incident.
After recovery, the handler confirms with affected business parties whether any data needs restoration, reports the impact to the coordinator, and assists the business side in repairing the data promptly.
Organize Fault Review
A fault review is scheduled within 24 hours after resolution and includes a walkthrough of the incident, root‑cause analysis, preventive measures, and fault classification. The output is a fault analysis report.
Faults are classified into four levels (P1–P4) based on business impact scope and duration, with each business unit defining its own criteria. The current fault‑report template is shown below:
Synchronize Fault Report
Participants in the fault review typically include the handler, coordinator, responsible owner, and the owner’s team lead; the report author may join voluntarily.
To keep all technical teammates informed, the fault owner shares the final report in the product‑technology group.
Create Action Sub‑Tasks in JIRA
The fault owner creates a sub‑task for each action under the main JIRA fault ticket, sets the sub‑task’s due date to the action’s deadline, and assigns it to the responsible executor.
Track Faults and Actions
JIRA boards provide a visual tool for moving tasks across workflow states. Youzan uses a Kanban board with three lanes: Faults, Overdue Fault Actions, and Pending Fault Actions.
If an action’s due date has passed, it appears in the Overdue lane; otherwise, it stays in the Pending lane. The coordinator regularly follows up on overdue actions and posts reminders in the product‑technology group.
Fault Data Analysis
By analyzing fault data stored in JIRA and Confluence (exported to Numbers), Youzan examines metrics such as monthly fault count, average resolution time, fault‑level distribution, fault‑type distribution, source breakdown, and per‑team fault counts.
Combined with release frequency data, the analysis reveals that months with many releases tend to have higher online issues and fault counts, suggesting that reducing release frequency and standardizing the release process can lower fault incidence.
Conclusion
Designing a process is straightforward; enforcing and monitoring it is the challenge. Youzan’s R&D efficiency team tracks and supervises the online fault‑handling workflow, ensuring every incident undergoes review and produces a complete analysis report that is shared with all technical staff. Each action is assigned an owner and a clear deadline.
After more than a year of fault management, Youzan has accumulated valuable data that guides improvement, raises fault awareness, and strengthens respect for the production environment and rapid response culture.
There is still room for improvement: internal system faults are not yet managed, fault information is scattered across JIRA, Confluence, and reports without a unified search or auto‑generated reporting platform, and event management remains low, with many incidents reported by customers rather than detected by monitoring.
Note: This article is reproduced from the Youzan technology team blog, author Yang Bo.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.