Operations 17 min read

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

Open Source Linux

Aug 23, 2024

10 Proven Ops Practices to Prevent System Failures

System failures are the perpetual pain for operations engineers. High availability is a common KPI, and while definitions differ across companies, the methods to avoid failures converge.

1. Ensure every change has a rollback and is tested in the same environment

All changes must have a rollback plan that has been tested in an identical environment. Untried changes are the most likely to cause unexpected failures, as experience from years of operations at Alibaba shows.

2. Treat destructive operations with extreme caution

Destructive commands such as DROP TABLE, DROP DATABASE, TRUNCATE TABLE, or DELETE FROM ... are hard to reverse. Even a simple rm -r can wipe data if mis‑used. To mitigate this, alias dangerous commands to prompt for confirmation: alias rm='rm -i --' Similarly, add interactive flags to cp and mv:

alias cp='cp -i --'
alias mv='mv -i --'

3. Set informative command prompts

Configure your MySQL client and shell prompts so you always know which user, host, database, and directory you are operating in. Example MySQL prompt: prompt="\\u@\\h : \\d \\r:\\m:\\s> " Resulting prompt example: [email protected] : woqutech 08:24:36> For Bash, customize PS1 and PROMPT_COMMAND to display user, host, and current directory, and to set the terminal title:

export PS1='
[e\u@\h \w]$ '
PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}\007"'

4. Backup and verify backup integrity

Both hot (real‑time) and cold (offline) backups are essential. Use tools such as mysqldump for logical backups, xtrabackup for physical backups, and pt‑slave‑delay for delayed replication. Always test restores on a separate instance and verify that backups can be applied, e.g., using --apply‑log for xtrabackup. Consistency checks with pt‑table‑checksum and pt‑table‑sync are recommended.

5. Treat production environments with respect

Audit production accounts, enforce least‑privilege access, rotate and encrypt passwords, isolate production from external networks, avoid using development or test procedures in production, and assign dedicated personnel for releases.

6. Handovers and vacations are high‑risk periods

Document all routine tasks, clarify critical databases and accounts, and ensure thorough knowledge transfer before any personnel change. Verify that incoming operators double‑check every step and confirm details with the outgoing engineer.

7. Build alerting and performance monitoring

Set up monitoring to capture historical trends and trigger alerts for replication issues, I/O latency, and MySQL command statistics (e.g., Com_delete, Com_insert, Com_update, Com_select). Use tools like Oracle AWR, MySQL performance schema, and flash‑card metrics ( logical_written_bytes, physical_read_bytes, etc.).

8. Use automatic failover cautiously

Automatic HA solutions (e.g., Heartbeat) can reduce downtime but must be evaluated for data lag, read‑only status, and potential loss of in‑flight transactions. Ensure the standby is fully synchronized before relying on automatic switchover.

9. Be meticulous: check, re‑check, and obsess over details

Adopt a disciplined change process: announce changes early, review scripts with peers, test in staging, copy to production, verify file paths, log out and back in to confirm the correct host, and finally execute the script in a controlled session (e.g., using screen to survive network interruptions).

10. Simplicity is the ultimate sophistication

Prefer built‑in commands and lightweight scripts over heavyweight third‑party tools. Stick to Unix philosophy: use the simplest, most reliable solution that meets the requirement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Linux system reliability mysql incident response Backup

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.