Operations 17 min read

10 Proven Practices to Prevent System Failures in Operations

This article shares ten practical operations strategies—ranging from change‑rollback procedures and cautious handling of destructive commands to robust backup verification, alerting, and meticulous hand‑over practices—that together help teams dramatically reduce system outages and maintain high availability.

Efficient Ops

Aug 21, 2024

10 Proven Practices to Prevent System Failures in Operations

1. Ensure every change has a tested rollback

All operational changes must include a rollback plan that has been tested in an identical environment; untested changes are the most common source of failures.

2. Treat destructive commands with extreme caution

Commands such as DROP TABLE, DROP DATABASE, TRUNCATE, or recursive rm -r can cause irreversible data loss. Adding interactive prompts mitigates accidental deletions:

alias rm='rm -i --'

alias cp='cp -i --'

alias mv='mv -i --'

3. Configure informative command prompts

Set MySQL and shell prompts to display user, host, database, and time, so you always know the context of your actions.

prompt="\\u@\\h : \\d \\r:\\m:\\s> "

export PS1='
\e[1;37m[\e[m\e[1;31m\u\e[m\e[1;31@\e[m\e[1;31\h\e[m \e[4m`pwd`\e[m\e[1;37m]\e[m\e[1;36m\e[m
\$'

PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}"; echo -ne "\007"'

4. Backup and verify backup integrity

Implement both hot (real‑time) and cold backups, using tools such as mysqldump, xtrabackup, or delayed replication. Always test restores on a separate instance and verify consistency with tools like pt-table-checksum and pt-table-sync.

5. Treat the production environment with respect

Audit production accounts, enforce least‑privilege for root and other users, encrypt passwords, isolate production from external networks, and avoid performing development tasks on live systems.

6. Hand‑overs and vacations are high‑risk periods

Document all routine procedures, verify critical steps with the original operator, and ensure thorough knowledge transfer before any personnel change.

7. Build alerting and performance monitoring

Configure alerts for replication issues, I/O latency, and key MySQL metrics (e.g., Com_delete, Com_insert, Com_update, Com_select) so problems are detected early.

8. Use automatic failover cautiously

Automatic HA tools (e.g., Heartbeat) can reduce downtime, but verify that the standby is fully synchronized and writable before relying on it.

9. Be obsessive about verification

Notify stakeholders well in advance, review scripts collectively, double‑check file paths, and run changes in a controlled background session (e.g., screen -S mysession) to avoid accidental interruptions.

10. Simplicity wins

Prefer built‑in Unix commands and native MySQL features over complex third‑party tools; a simple, well‑understood solution is less likely to introduce new failure modes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux mysql

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.