Operations 9 min read

Treat Every Ops Change Like a Project: Lessons from a Simple Raid Rebuild

The article uses a real‑world raid‑rebuild incident to illustrate why operations teams must understand change background, schedule, and risk, act as project managers, follow a formal change process, and treat production environments with utmost respect.

Efficient Ops

Jan 6, 2025

Treat Every Ops Change Like a Project: Lessons from a Simple Raid Rebuild

Operations is often an awkward role in IT, misunderstood by fresh graduates and other engineers, and its impact on KPI is hard to demonstrate.

Many newcomers have never encountered the role and are unclear about its responsibilities.

Some seasoned engineers dismiss it as "low‑level" work that only involves restarts and configuration pushes.

Veterans in ops feel their contributions are hard to quantify, leading to questions about the purpose of their work.

Ops cannot be taught solely in school; it requires hands‑on experience.

A small incident and the thoughts it triggered

A colleague posted a request to rebuild RAID on a server and ignore alerts.

127.0.0.1 重做 raid，告警忽略@同事B@同事C

Although the request seemed simple, recent production failures made it clear that hidden risks could be severe.

Ops need to understand the “change background”

Me: A, do you know the background of this change? A: X told me we need to redo RAID. Me: Why redo RAID? A: To deploy an FTP service using RAID5; the current setup is non‑RAID.

Understanding the purpose prevents blind execution that can lead to costly incidents.

Each change is like merging onto a busy road: the more merges, the higher the accident risk. Teams must assess whether a change is truly necessary and challenge requirements when appropriate.

Caption: A change can be as dangerous as a car crash.

Ops need to know the “right time” for change

Me: When will you do it? A: I’ll do it as soon as I get the task. Me: There’s a product demo at 2‑3 PM; a mistake could ruin it. What would the client think? A: I didn’t consider that.

Principles for timing:

Avoid peak business periods and critical windows.

Do not overlap with other changes in the same product line.

Do not overlap with changes in related product lines.

In the example, A missed the demo schedule because of limited information channels. The solution is to treat each change as a project, synchronizing progress with all stakeholders who assess risks.

Ops must act as the “change project manager”

Me: Do you know the server’s prior state? A: No. Me: If critical services are still running, what could happen? A: X said the machine is new and has no services. Me: Ops must be the final safety net, questioning assumptions and confirming facts.

Like firefighters confirming no hidden explosives before entering a blaze, ops must verify upstream and downstream dependencies before proceeding.

Follow the change process

Me: Why do RAID first and then ask others to ignore alerts? A: It seems fine, just a quick notice. Me: Why not disable alerts first, then change? A: Is it necessary? Me: Ignoring details can cause massive outages; I once caused 600 nodes to crash, delaying detection by 5 minutes and incurring huge loss.

Typical change workflow:

Requirement confirmation

Stakeholder identification

Solution discussion

Solution and schedule finalization

Change ticket creation

Ticket review

Approval and reporting

Change announcement

Implementation

Effect feedback (and rollback if needed)

Following the process helps identify risks, standardize documentation, and enable automation, reducing failure probability dramatically.

One truth

Operations is a global orchestrator; attention to detail is essential. As a guiding principle: Respect the production environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations change management risk assessment production

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.