Essential Ops Lessons: 22 Hard‑Earned Rules to Prevent Disasters
Drawing from three and a half years of DevOps experience, this guide compiles critical operational practices—ranging from cautious command execution and strict change control to robust backup, security hardening, continuous monitoring, performance tuning, and a disciplined mindset—to help engineers avoid costly outages and data loss.
1. Online Operation Guidelines
1. Testing Use When first gaining root access, I impulsively switched from PuTTY to Xshell and altered SSH settings without testing, which locked me out after a restart; a backup of sshd_config saved the day.
2. Confirm Before Enter A mistaken rsync command can delete source data faster than rm -rf; lacking a backup, production data was lost.
3. Avoid Multi‑Person Operations In a previous company, many people shared the root password, leading to conflicting changes and confusion during incident resolution.
4. Backup Before Changes Always back up configuration files (e.g., .conf) before editing, comment out original options, then modify copies.
5. Use rm -rf Sparingly A single typo can cause massive loss; treat deletions with extreme caution.
6. Backup Is Paramount Regular backups—every two hours for payment systems, every 20 minutes for loan platforms—are essential to prevent irreversible damage.
7. Stability Over Speed Prioritize a stable, reliable environment; avoid deploying untested software in production.
8. Confidentiality Is Critical Protect sensitive data and prevent exposure of backdoors.
2. Data‑Related Practices
Emphasize frequent, reliable backups; treat data loss as a severe risk.
3. Security Measures
9. SSH Hardening
Change the default port (recognizing that determined attackers can scan it).
Disable root login.
Use normal users with key authentication, sudo rules, IP restrictions, and user limits.
Employ host‑deny‑like tools to block repeated failed attempts.
Audit /etc/passwd for unauthorized users.
10. Firewall Enable a firewall in production, follow the principle of least privilege: drop all traffic by default and explicitly allow required service ports.
11. Fine‑Grained Permissions Run services with the least privileged accounts; avoid running anything as root.
12. Intrusion Detection & Log Monitoring Deploy third‑party tools to watch critical files (e.g., /etc/passwd, /etc/my.cnf, /etc/httpd/conf/httpd.conf) and centralize logs for security‑related alerts.
4. Daily Monitoring
13. System Health Monitoring Track hardware utilization—CPU, memory, disk, network—and OS login activity.
14. Service Monitoring Monitor web, database, and load‑balancer metrics to quickly detect performance bottlenecks.
15. Log Monitoring Collect OS, application, and hardware error logs; use them to anticipate issues before they impact stability.
5. Performance Tuning
16. Understand Runtime Mechanisms Before tuning, study how software (e.g., Nginx vs. Apache) processes requests and be able to explain it clearly.
17. Tuning Framework & Order Identify bottlenecks via logs, then tune; prioritize hardware and OS before database configuration.
18. Change One Parameter at a Time Isolate each adjustment to avoid confusion.
19. Benchmarking Use baseline tests to verify the impact of changes and ensure they meet real‑world workload requirements.
6. Ops Mindset
20. Control Your Temper Avoid critical operations when stressed; fatigue increases the risk of catastrophic commands.
21. Take Data Responsibility Treat production data as non‑negotiable; always have a recovery plan.
22. Investigate Root Causes After an incident, dig deep—e.g., a MySQL crash caused by OOM due to insufficient memory and missing swap.
23. Separate Test and Production Verify actions on test machines and limit open terminals during critical changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
