Essential DBA & Ops Practices to Prevent System Failures
This article outlines ten practical guidelines for DBAs and system administrators—including rollback‑ready changes, cautious use of destructive commands, prompt customization, reliable backups, production respect, thorough handovers, alerting, monitoring, careful failover, meticulous checks, and the virtue of simplicity—to minimize costly system outages.
1. Ensure changes are rollback‑able and tested in an identical environment
Operations are a discipline of experience and trial‑and‑error; protect the production site so that every change can be reverted if needed.
2. Treat destructive operations with extreme caution
Examples for Oracle: truncate table_name 、 delete table_name 、 drop table_name – easy to run but costly even if rollback is possible.
Examples for Linux: rm -r deletes all files in the current and sub‑directories. Many users alias it to prevent accidents:
alias rm='rm -i' alias cp='cp -i'
alias mv='mv -i'3. Configure informative command prompts
Before executing commands, know whether you are on the primary or standby, the current directory, schema, session, and time.
Oracle example:
set sqlprompt 'RAC-node1-primary@10g>>'
RAC-node1-primary@10g>>For Linux, customize PS1 to display host, user, and directory.
4. Backup and verify backup integrity
Backups are essential; they can be classified as cold/hot, real‑time/non‑real‑time, physical/logical. Even with real‑time hot backups, you still need non‑real‑time backups to recover from logical errors such as accidental DELETE statements.
Always validate backups by restoring them to an empty database.
5. Treat production environments with reverence
Adopt professional ethics similar to accountants. Run health checks (e.g., Oracle RDA inspections, Linux password aging policies, network isolation).
6. Handover and vacation periods are high‑risk
When taking over work, repeatedly confirm change plans and document procedures before leaving. Prepare detailed handover documents specifying actions and contacts.
7. Build alerting and performance monitoring
Alerting lets you know about anomalies instantly; monitoring provides historical performance data for trend analysis and optimization.
8. Use automatic failover cautiously
In Oracle Data Guard, a switchover that does not replicate a transaction can cause lost orders and revenue.
9. Be meticulous and double‑check everything
Notify stakeholders weeks in advance via email and phone.
Write scripts on a test machine and conduct a peer review.
Copy scripts to production after testing.
Record the exact sequence of commands.
Confirm with all parties the steps, timing, impact, and rollback plan.
Log out, then log back in before running the script.
Execute the script while monitoring output from another terminal.
10. Simplicity is the ultimate sophistication
Resist the temptation to adopt new architectures, tools, or hardware unless they are truly needed in production. Prefer built‑in Linux commands over complex third‑party software; simple text‑based tools are often more reliable.
Wishing all operations professionals smooth, fault‑free work.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
