10 Proven Practices to Prevent System Failures in Operations
This article shares ten practical operations strategies—ranging from change‑rollback procedures and cautious handling of destructive commands to robust backup verification, alerting, and meticulous hand‑over practices—that together help teams dramatically reduce system outages and maintain high availability.
1. Ensure every change has a tested rollback
All operational changes must include a rollback plan that has been tested in an identical environment; untested changes are the most common source of failures.
2. Treat destructive commands with extreme caution
Commands such as
DROP TABLE,
DROP DATABASE,
TRUNCATE, or recursive
rm -rcan cause irreversible data loss. Adding interactive prompts mitigates accidental deletions:
<code>alias rm='rm -i --'</code> <code>alias cp='cp -i --'</code> <code>alias mv='mv -i --'</code>3. Configure informative command prompts
Set MySQL and shell prompts to display user, host, database, and time, so you always know the context of your actions.
<code>prompt="\\u@\\h : \\d \\r:\\m:\\s> "</code> <code>export PS1='\n\e[1;37m[\e[m\e[1;31m\u\e[m\e[1;31@\e[m\e[1;31\h\e[m \e[4m`pwd`\e[m\e[1;37m]\e[m\e[1;36m\e[m\n\$'</code> <code>PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}"; echo -ne "\007"'</code>4. Backup and verify backup integrity
Implement both hot (real‑time) and cold backups, using tools such as
mysqldump,
xtrabackup, or delayed replication. Always test restores on a separate instance and verify consistency with tools like
pt-table-checksumand
pt-table-sync.
5. Treat the production environment with respect
Audit production accounts, enforce least‑privilege for root and other users, encrypt passwords, isolate production from external networks, and avoid performing development tasks on live systems.
6. Hand‑overs and vacations are high‑risk periods
Document all routine procedures, verify critical steps with the original operator, and ensure thorough knowledge transfer before any personnel change.
7. Build alerting and performance monitoring
Configure alerts for replication issues, I/O latency, and key MySQL metrics (e.g.,
Com_delete,
Com_insert,
Com_update,
Com_select) so problems are detected early.
8. Use automatic failover cautiously
Automatic HA tools (e.g., Heartbeat) can reduce downtime, but verify that the standby is fully synchronized and writable before relying on it.
9. Be obsessive about verification
Notify stakeholders well in advance, review scripts collectively, double‑check file paths, and run changes in a controlled background session (e.g.,
screen -S mysession) to avoid accidental interruptions.
10. Simplicity wins
Prefer built‑in Unix commands and native MySQL features over complex third‑party tools; a simple, well‑understood solution is less likely to introduce new failure modes.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.