10 Proven Ops Practices to Prevent System Failures
This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.
System failures are the perpetual pain for operations engineers. High availability is a common KPI, and while definitions differ across companies, the methods to avoid failures converge.
1. Ensure every change has a rollback and is tested in the same environment
All changes must have a rollback plan that has been tested in an identical environment. Untried changes are the most likely to cause unexpected failures, as experience from years of operations at Alibaba shows.
2. Treat destructive operations with extreme caution
Destructive commands such as
DROP TABLE,
DROP DATABASE,
TRUNCATE TABLE, or
DELETE FROM ...are hard to reverse. Even a simple
rm -rcan wipe data if mis‑used. To mitigate this, alias dangerous commands to prompt for confirmation:
alias rm='rm -i --'Similarly, add interactive flags to
cpand
mv:
alias cp='cp -i --'
alias mv='mv -i --'3. Set informative command prompts
Configure your MySQL client and shell prompts so you always know which user, host, database, and directory you are operating in. Example MySQL prompt:
prompt="\\u@\\h : \\d \\r:\\m:\\s> "Resulting prompt example:
[email protected] : woqutech 08:24:36>For Bash, customize
PS1and
PROMPT_COMMANDto display user, host, and current directory, and to set the terminal title:
export PS1='
[e\u@\h \w]$ '
PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}\007"'4. Backup and verify backup integrity
Both hot (real‑time) and cold (offline) backups are essential. Use tools such as
mysqldumpfor logical backups,
xtrabackupfor physical backups, and
pt‑slave‑delayfor delayed replication. Always test restores on a separate instance and verify that backups can be applied, e.g., using
--apply‑logfor
xtrabackup. Consistency checks with
pt‑table‑checksumand
pt‑table‑syncare recommended.
5. Treat production environments with respect
Audit production accounts, enforce least‑privilege access, rotate and encrypt passwords, isolate production from external networks, avoid using development or test procedures in production, and assign dedicated personnel for releases.
6. Handovers and vacations are high‑risk periods
Document all routine tasks, clarify critical databases and accounts, and ensure thorough knowledge transfer before any personnel change. Verify that incoming operators double‑check every step and confirm details with the outgoing engineer.
7. Build alerting and performance monitoring
Set up monitoring to capture historical trends and trigger alerts for replication issues, I/O latency, and MySQL command statistics (e.g.,
Com_delete,
Com_insert,
Com_update,
Com_select). Use tools like Oracle AWR, MySQL performance schema, and flash‑card metrics (
logical_written_bytes,
physical_read_bytes, etc.).
8. Use automatic failover cautiously
Automatic HA solutions (e.g., Heartbeat) can reduce downtime but must be evaluated for data lag, read‑only status, and potential loss of in‑flight transactions. Ensure the standby is fully synchronized before relying on automatic switchover.
9. Be meticulous: check, re‑check, and obsess over details
Adopt a disciplined change process: announce changes early, review scripts with peers, test in staging, copy to production, verify file paths, log out and back in to confirm the correct host, and finally execute the script in a controlled session (e.g., using
screento survive network interruptions).
10. Simplicity is the ultimate sophistication
Prefer built‑in commands and lightweight scripts over heavyweight third‑party tools. Stick to Unix philosophy: use the simplest, most reliable solution that meets the requirement.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.