Treat Every Ops Change Like a Project: Lessons from a Simple Raid Rebuild
The article uses a real‑world raid‑rebuild incident to illustrate why operations teams must understand change background, schedule, and risk, act as project managers, follow a formal change process, and treat production environments with utmost respect.
Operations is often an awkward role in IT, misunderstood by fresh graduates and other engineers, and its impact on KPI is hard to demonstrate.
Many newcomers have never encountered the role and are unclear about its responsibilities.
Some seasoned engineers dismiss it as "low‑level" work that only involves restarts and configuration pushes.
Veterans in ops feel their contributions are hard to quantify, leading to questions about the purpose of their work.
Ops cannot be taught solely in school; it requires hands‑on experience.
A small incident and the thoughts it triggered
A colleague posted a request to rebuild RAID on a server and ignore alerts.
127.0.0.1 重做 raid,告警忽略@同事B@同事C
Although the request seemed simple, recent production failures made it clear that hidden risks could be severe.
Ops need to understand the “change background”
Me: A, do you know the background of this change? A: X told me we need to redo RAID. Me: Why redo RAID? A: To deploy an FTP service using RAID5; the current setup is non‑RAID.
Understanding the purpose prevents blind execution that can lead to costly incidents.
Each change is like merging onto a busy road: the more merges, the higher the accident risk. Teams must assess whether a change is truly necessary and challenge requirements when appropriate.
Caption: A change can be as dangerous as a car crash.
Ops need to know the “right time” for change
Me: When will you do it? A: I’ll do it as soon as I get the task. Me: There’s a product demo at 2‑3 PM; a mistake could ruin it. What would the client think? A: I didn’t consider that.
Principles for timing:
Avoid peak business periods and critical windows.
Do not overlap with other changes in the same product line.
Do not overlap with changes in related product lines.
In the example, A missed the demo schedule because of limited information channels. The solution is to treat each change as a project, synchronizing progress with all stakeholders who assess risks.
Ops must act as the “change project manager”
Me: Do you know the server’s prior state? A: No. Me: If critical services are still running, what could happen? A: X said the machine is new and has no services. Me: Ops must be the final safety net, questioning assumptions and confirming facts.
Like firefighters confirming no hidden explosives before entering a blaze, ops must verify upstream and downstream dependencies before proceeding.
Follow the change process
Me: Why do RAID first and then ask others to ignore alerts? A: It seems fine, just a quick notice. Me: Why not disable alerts first, then change? A: Is it necessary? Me: Ignoring details can cause massive outages; I once caused 600 nodes to crash, delaying detection by 5 minutes and incurring huge loss.
Typical change workflow:
Requirement confirmation
Stakeholder identification
Solution discussion
Solution and schedule finalization
Change ticket creation
Ticket review
Approval and reporting
Change announcement
Implementation
Effect feedback (and rollback if needed)
Following the process helps identify risks, standardize documentation, and enable automation, reducing failure probability dramatically.
One truth
Operations is a global orchestrator; attention to detail is essential. As a guiding principle: Respect the production environment.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.