10 Essential Ops Practices to Prevent System Failures
This article compiles ten practical operations‑engineer guidelines—ranging from change rollbacks and safe command aliases to backup verification, monitoring, and cautious automated failover—to help maintain high availability and avoid costly production incidents.
Rule 1: Always test changes with a rollback plan
Every change should be performed in an environment identical to production and verified that it can be rolled back. Untried changes are the most likely source of unexpected failures, so experienced operators treat any operation without a rollback option as a high‑risk action.
Rule 2: Handle destructive operations with extreme care
Commands such as DROP TABLE, DROP DATABASE, TRUNCATE TABLE, or rm -rf can permanently erase data. To mitigate accidental loss, alias dangerous commands to prompt for confirmation:
alias rm='rm -i' alias cp='cp -i' alias mv='mv -i'This forces an interactive check before deletion or overwriting.
Rule 3: Configure informative command prompts
Set MySQL and shell prompts so you always know the current user, host, database, and time. Example MySQL prompt: prompt="\u@\h : \d \r:\m:\s> " When placed in [mysql] of my.cnf, the prompt appears as: [email protected] : woqutech 08:24:36> For Bash, a typical PS1 setting is:
export PS1='
\e[1;37m[\e[m\e[1;31m\u\e[m\e[1;31m@\e[m\e[1;31m\h\e[m \e[4mpwd\e[m\e[1;37m]\e[m\e[1;36m
\$'Additionally, PROMPT_COMMAND can set the terminal title for each tab, preventing accidental operations on the wrong session.
Rule 4: Backup and verify backup integrity
Implement both hot (real‑time) and cold backups. Use tools such as mysqldump for logical backups, xtrabackup for physical backups, and pt‑slave‑delay for delayed replication. After each backup, restore it to a test instance to confirm that data can be recovered, and regularly run pt‑table‑checksum and pt‑table‑sync to ensure master‑slave consistency.
Rule 5: Treat production environments with reverence
Audit production accounts, restrict root access, enforce strong password policies, isolate production from external networks, and avoid running development or testing tasks on live servers. Document all privileged accounts and regularly review their necessity.
Rule 6: Hand‑offs and vacations are high‑risk periods
During personnel changes, ensure detailed hand‑over documentation, verify that all scripts and procedures are reviewed, and confirm critical steps with the outgoing engineer. Unexpected failures increase by over 50 % when knowledge transfer is incomplete.
Rule 7: Build alerting and performance monitoring
Deploy monitoring tools (e.g., Nagios, Cacti, Percona‑monitor‑plugins) to collect metrics such as I/O latency, MySQL command counters, and flash storage throughput. Configure alerts for replication failures, I/O stalls, or abnormal metric thresholds so issues are caught before they impact users.
Rule 8: Use automated failover cautiously
Automated HA solutions (e.g., Heartbeat) can switch VIPs quickly, but they may ignore data lag, incomplete binlog sync, or read‑only status on the standby. Verify that the standby is fully synchronized and that business‑critical transactions have been replicated before trusting an automatic switchover.
Rule 9: Be obsessive about checking
Adopt a “check, double‑check, triple‑check” mindset: pre‑announce changes, review scripts with peers, copy scripts to production, verify paths, log out and back in to confirm the session, and finally execute the operation while monitoring output in a separate window.
Rule 10: Simplicity is the ultimate sophistication
Prefer built‑in Unix commands and lightweight scripts over heavyweight third‑party tools. Avoid unnecessary complexity in MySQL configuration, and only introduce new hardware or HA software after thorough, long‑duration testing.
Additional useful commands: screen -S woqutech to start a detachable session and screen -dr woqutech to re‑attach after a network interruption.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
