Operations 18 min read

10 Essential Ops Practices to Prevent System Failures

This article compiles ten practical operations‑engineer guidelines—ranging from change rollbacks and safe command aliases to backup verification, monitoring, and cautious automated failover—to help maintain high availability and avoid costly production incidents.

dbaplus Community
dbaplus Community
dbaplus Community
10 Essential Ops Practices to Prevent System Failures

Rule 1: Always test changes with a rollback plan

Every change should be performed in an environment identical to production and verified that it can be rolled back. Untried changes are the most likely source of unexpected failures, so experienced operators treat any operation without a rollback option as a high‑risk action.

Rule 2: Handle destructive operations with extreme care

Commands such as DROP TABLE, DROP DATABASE, TRUNCATE TABLE, or rm -rf can permanently erase data. To mitigate accidental loss, alias dangerous commands to prompt for confirmation:

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

This forces an interactive check before deletion or overwriting.

Rule 3: Configure informative command prompts

Set MySQL and shell prompts so you always know the current user, host, database, and time. Example MySQL prompt: prompt="\u@\h : \d \r:\m:\s> " When placed in [mysql] of my.cnf, the prompt appears as: [email protected] : woqutech 08:24:36> For Bash, a typical PS1 setting is:

export PS1='
\e[1;37m[\e[m\e[1;31m\u\e[m\e[1;31m@\e[m\e[1;31m\h\e[m \e[4mpwd\e[m\e[1;37m]\e[m\e[1;36m
\$'

Additionally, PROMPT_COMMAND can set the terminal title for each tab, preventing accidental operations on the wrong session.

Rule 4: Backup and verify backup integrity

Implement both hot (real‑time) and cold backups. Use tools such as mysqldump for logical backups, xtrabackup for physical backups, and pt‑slave‑delay for delayed replication. After each backup, restore it to a test instance to confirm that data can be recovered, and regularly run pt‑table‑checksum and pt‑table‑sync to ensure master‑slave consistency.

Rule 5: Treat production environments with reverence

Audit production accounts, restrict root access, enforce strong password policies, isolate production from external networks, and avoid running development or testing tasks on live servers. Document all privileged accounts and regularly review their necessity.

Rule 6: Hand‑offs and vacations are high‑risk periods

During personnel changes, ensure detailed hand‑over documentation, verify that all scripts and procedures are reviewed, and confirm critical steps with the outgoing engineer. Unexpected failures increase by over 50 % when knowledge transfer is incomplete.

Rule 7: Build alerting and performance monitoring

Deploy monitoring tools (e.g., Nagios, Cacti, Percona‑monitor‑plugins) to collect metrics such as I/O latency, MySQL command counters, and flash storage throughput. Configure alerts for replication failures, I/O stalls, or abnormal metric thresholds so issues are caught before they impact users.

Rule 8: Use automated failover cautiously

Automated HA solutions (e.g., Heartbeat) can switch VIPs quickly, but they may ignore data lag, incomplete binlog sync, or read‑only status on the standby. Verify that the standby is fully synchronized and that business‑critical transactions have been replicated before trusting an automatic switchover.

Rule 9: Be obsessive about checking

Adopt a “check, double‑check, triple‑check” mindset: pre‑announce changes, review scripts with peers, copy scripts to production, verify paths, log out and back in to confirm the session, and finally execute the operation while monitoring output in a separate window.

Rule 10: Simplicity is the ultimate sophistication

Prefer built‑in Unix commands and lightweight scripts over heavyweight third‑party tools. Avoid unnecessary complexity in MySQL configuration, and only introduce new hardware or HA software after thorough, long‑duration testing.

Additional useful commands: screen -S woqutech to start a detachable session and screen -dr woqutech to re‑attach after a network interruption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringAutomationLinuxsystem reliabilitymysql
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.