Operations 8 min read

Why Ops Must Treat Every Change Like a Project: Lessons from a RAID Incident

The article reflects on the often‑overlooked challenges of IT operations, using a real‑world RAID change incident to illustrate why understanding change background, timing, and process is essential for risk mitigation and treating each change as a mini‑project.

Open Source Linux

Mar 14, 2021

Why Ops Must Treat Every Change Like a Project: Lessons from a RAID Incident

Operations Is an Awkward Role

Operations is often considered an awkward position in IT because many fresh graduates have never encountered it, other technical staff view it as low‑level, and current ops engineers struggle to reflect their KPI impact.

Operations cannot be cultivated solely in school; it requires hands‑on practice. This article does not aim to define the meaning of ops, but to discuss the operational awareness a production line should have.

A Small Incident Sparks Big Thoughts

A colleague (A) reported a need to restart a server and redo RAID. Although the request seemed simple, recent production failures prompted a deeper conversation.

127.0.0.1 redo RAID, ignore alerts @colleagueB @colleagueC

The discussion revealed three key questions for any change:

Do you understand the background of the change?

Is the timing appropriate?

Are you aware of the server’s prior state and potential impact?

Understanding the Change Background

The engineer answered that the RAID was needed to deploy an FTP service, but proceeded without questioning why.

Me: Do you know why the change is needed? A: Because X told me to redo RAID.

Blindly executing changes without understanding their purpose increases risk, much like driving without checking for hazards.

Choosing the Right Time

When asked about the timing, A said he would act immediately. The author highlighted that the change conflicted with a product demo scheduled at 2‑3 pm, illustrating the need to avoid peak periods.

Me: What if the change causes the demo to fail? A: I didn’t think of that.

Principles for timing:

Avoid business peak periods.

Ensure changes on the same product line are mutually exclusive.

Coordinate with other product lines.

In this case, A missed the demo notice, violating principle a.

Acting as the Change Project Manager

Ops should treat each change as a project, synchronizing progress with stakeholders and conducting risk assessments.

Me: Do you know the server’s previous state? A: No. Me: If core services are still running, what could happen? A: X said the machine was new.

Ops must question assumptions, similar to firefighters confirming no hidden explosives before entering a fire.

Following a Structured Change Process

The recommended change workflow is:

Requirement → Stakeholder Confirmation → Solution Discussion → Solution & Timing Confirmation → Change Request Draft → Review → Approval → Announcement → Implementation → Feedback (optional rollback).

Following this process helps identify risks, standardize documentation, and reduce failure probability.

Final Takeaway

Even a tiny change can teach many lessons; ops engineers must be meticulous, treat changes as projects, and respect production environments.

Respect the production environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

best practices change management risk assessment IT

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.