Operations 9 min read

Why Every Ops Change Should Be Treated Like a Project

This article shares practical lessons from a real‑world ops incident, emphasizing the need for clear change background, optimal timing, project‑style management, and strict process adherence to reduce risk and improve production reliability.

Efficient Ops
Efficient Ops
Efficient Ops
Why Every Ops Change Should Be Treated Like a Project

Operations is often an awkward position in IT: fresh graduates rarely encounter it, other engineers may look down on it as merely restarting services, and seasoned ops staff find it hard to reflect their KPI in company metrics, leading many to wonder about the purpose of their role.

Operations cannot be taught in school; it must be learned through practice. This piece is not about the meaning of ops per se, but about the ops mindset a production line should have, illustrated by a recent small incident.

A Small Incident and the Reflections

A colleague posted a request to restart a server and redo RAID:

127.0.0.1 redo RAID, ignore alerts @colleagueB @colleagueC

At first glance the request seemed harmless, but given recent production‑related losses, I probed deeper and discovered hidden risks.

Ops Need to Understand the Change Background

Me: Do you know the background of this change? A: X told me we need to redo RAID. Me: Why redo RAID? A: To deploy an FTP service using RAID‑5; the current setup is non‑RAID.

Although A could answer, acting without questioning the purpose is dangerous. Each change is like merging into traffic; without assessing risk and value, you increase the chance of a “collision.” Always ask whether the change is truly necessary and if the requester’s demand should be challenged.

车祸猛如虎变更也一样

Ops Need to Choose the Right Timing for Changes

Me: When will you do it? A: I want to do it right away. Me: There is a product demo at 2‑3 PM; if you cause an error, the demo fails. How would the client react? A: I didn’t think of that. If every change can cause a fault, you must confirm the optimal window. Principles: a. Avoid peak business periods and critical windows. b. Do not overlap with other changes in the same product line. c. Do not overlap with changes in related product lines.

A’s narrow information channel missed the demo schedule, violating principle a. The way to mitigate this risk is to treat the change as a project, synchronizing progress with all stakeholders who then perform risk assessment.

Ops Need to Be the Project Manager of Changes

Me: Do you know the server’s current state? A: No, I didn’t think about it. Me: If critical services are still running, what could happen? A: X said the machine is brand‑new, no services. Me: Ops must be the final line of defense, questioning assumptions and confirming facts.

Like firefighters checking for hidden explosives before entering a blaze, ops must verify upstream and downstream dependencies before proceeding. Every change starts the moment the requirement arrives and should be managed as a multi‑day project with clear stakeholders, risk controls, and step‑by‑step planning.

Ops Need to Follow a Change Process

Me: Why do RAID first and then ask colleagues to ignore alerts? A: It’s fine, just a quick reminder. Me: Why not follow the process, disable alerts before the change? A: Not necessary, SA does many ops daily. Me: Ignoring details leads to big mistakes. I once caused 600 nodes to crash because I didn’t follow the process. If alerts are flooded, a critical fault could be missed for five minutes—do you know the loss?

Typical change workflow: requirement confirmation → stakeholder identification → solution discussion → solution and timing approval → change ticket creation → ticket review → approval & reporting → change announcement → implementation → post‑implementation feedback (→ rollback plan). Following this workflow lets you think through risks, standardize low‑risk procedures, automate repetitive steps, and dramatically reduce failure probability. Mature ops teams embed such processes in their platforms, yet many still bypass them due to perceived pressure or staffing shortages.

A True Saying

From my time at Alibaba, I always remember: Respect the production environment.

Author: 大数据之心 Link: https://www.jianshu.com/p/16e952ca444a Source: 简书
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsDevOpsbest practiceschange managementrisk assessment
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.