Why Ops Must Treat Every Change Like a Project: Lessons from a RAID Incident
The article reflects on the often‑overlooked challenges of IT operations, using a real‑world RAID change incident to illustrate why understanding change background, timing, and process is essential for risk mitigation and treating each change as a mini‑project.
Operations Is an Awkward Role
Operations is often considered an awkward position in IT because many fresh graduates have never encountered it, other technical staff view it as low‑level, and current ops engineers struggle to reflect their KPI impact.
Operations cannot be cultivated solely in school; it requires hands‑on practice. This article does not aim to define the meaning of ops, but to discuss the operational awareness a production line should have.
A Small Incident Sparks Big Thoughts
A colleague (A) reported a need to restart a server and redo RAID. Although the request seemed simple, recent production failures prompted a deeper conversation.
127.0.0.1 redo RAID, ignore alerts @colleagueB @colleagueC
The discussion revealed three key questions for any change:
Do you understand the background of the change?
Is the timing appropriate?
Are you aware of the server’s prior state and potential impact?
Understanding the Change Background
The engineer answered that the RAID was needed to deploy an FTP service, but proceeded without questioning why.
Me: Do you know why the change is needed? A: Because X told me to redo RAID.
Blindly executing changes without understanding their purpose increases risk, much like driving without checking for hazards.
Choosing the Right Time
When asked about the timing, A said he would act immediately. The author highlighted that the change conflicted with a product demo scheduled at 2‑3 pm, illustrating the need to avoid peak periods.
Me: What if the change causes the demo to fail? A: I didn’t think of that.
Principles for timing:
Avoid business peak periods.
Ensure changes on the same product line are mutually exclusive.
Coordinate with other product lines.
In this case, A missed the demo notice, violating principle a.
Acting as the Change Project Manager
Ops should treat each change as a project, synchronizing progress with stakeholders and conducting risk assessments.
Me: Do you know the server’s previous state? A: No. Me: If core services are still running, what could happen? A: X said the machine was new.
Ops must question assumptions, similar to firefighters confirming no hidden explosives before entering a fire.
Following a Structured Change Process
The recommended change workflow is:
Requirement → Stakeholder Confirmation → Solution Discussion → Solution & Timing Confirmation → Change Request Draft → Review → Approval → Announcement → Implementation → Feedback (optional rollback).
Following this process helps identify risks, standardize documentation, and reduce failure probability.
Final Takeaway
Even a tiny change can teach many lessons; ops engineers must be meticulous, treat changes as projects, and respect production environments.
Respect the production environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
