Operations 17 min read

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

Open Source Linux
Open Source Linux
Open Source Linux
10 Proven Ops Practices to Prevent System Failures

System failures are the perpetual pain for operations engineers. High availability is a common KPI, and while definitions differ across companies, the methods to avoid failures converge.

1. Ensure every change has a rollback and is tested in the same environment

All changes must have a rollback plan that has been tested in an identical environment. Untried changes are the most likely to cause unexpected failures, as experience from years of operations at Alibaba shows.

2. Treat destructive operations with extreme caution

Destructive commands such as

DROP TABLE

,

DROP DATABASE

,

TRUNCATE TABLE

, or

DELETE FROM ...

are hard to reverse. Even a simple

rm -r

can wipe data if mis‑used. To mitigate this, alias dangerous commands to prompt for confirmation:

alias rm='rm -i --'

Similarly, add interactive flags to

cp

and

mv

:

alias cp='cp -i --'
alias mv='mv -i --'

3. Set informative command prompts

Configure your MySQL client and shell prompts so you always know which user, host, database, and directory you are operating in. Example MySQL prompt:

prompt="\\u@\\h : \\d \\r:\\m:\\s> "

Resulting prompt example:

[email protected] : woqutech 08:24:36>

For Bash, customize

PS1

and

PROMPT_COMMAND

to display user, host, and current directory, and to set the terminal title:

export PS1='
[e\u@\h \w]$ '
PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}\007"'

4. Backup and verify backup integrity

Both hot (real‑time) and cold (offline) backups are essential. Use tools such as

mysqldump

for logical backups,

xtrabackup

for physical backups, and

pt‑slave‑delay

for delayed replication. Always test restores on a separate instance and verify that backups can be applied, e.g., using

--apply‑log

for

xtrabackup

. Consistency checks with

pt‑table‑checksum

and

pt‑table‑sync

are recommended.

5. Treat production environments with respect

Audit production accounts, enforce least‑privilege access, rotate and encrypt passwords, isolate production from external networks, avoid using development or test procedures in production, and assign dedicated personnel for releases.

6. Handovers and vacations are high‑risk periods

Document all routine tasks, clarify critical databases and accounts, and ensure thorough knowledge transfer before any personnel change. Verify that incoming operators double‑check every step and confirm details with the outgoing engineer.

7. Build alerting and performance monitoring

Set up monitoring to capture historical trends and trigger alerts for replication issues, I/O latency, and MySQL command statistics (e.g.,

Com_delete

,

Com_insert

,

Com_update

,

Com_select

). Use tools like Oracle AWR, MySQL performance schema, and flash‑card metrics (

logical_written_bytes

,

physical_read_bytes

, etc.).

8. Use automatic failover cautiously

Automatic HA solutions (e.g., Heartbeat) can reduce downtime but must be evaluated for data lag, read‑only status, and potential loss of in‑flight transactions. Ensure the standby is fully synchronized before relying on automatic switchover.

9. Be meticulous: check, re‑check, and obsess over details

Adopt a disciplined change process: announce changes early, review scripts with peers, test in staging, copy to production, verify file paths, log out and back in to confirm the correct host, and finally execute the script in a controlled session (e.g., using

screen

to survive network interruptions).

10. Simplicity is the ultimate sophistication

Prefer built‑in commands and lightweight scripts over heavyweight third‑party tools. Stick to Unix philosophy: use the simplest, most reliable solution that meets the requirement.

monitoringoperationsLinuxsystem reliabilityMySQLincident responsebackup
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.