Essential Ops Rules: Avoid Disasters with Proven Server Management Practices
This article shares hard‑earned operational guidelines—ranging from cautious command usage and backup habits to security hardening, monitoring, performance tuning, and the right mindset—to help engineers prevent costly incidents and keep production systems stable and secure.
1. Online Operation Guidelines
1. Test before using – The author recounts early Linux learning on VMs, then describes a real‑world mistake: changing the SSH daemon configuration without a backup, which locked them out of the server until the original
sshd_configwas restored.
Another incident involved
rsync: using the wrong source/destination caused massive data loss because the command deleted the source directory when the target was empty.
2. Double‑check before pressing Enter – Mistyping destructive commands like
rm -rf /varcan happen quickly, especially under pressure or slow network conditions.
When you realize the damage, your heart will feel half‑frozen.
Even experienced engineers can suffer such accidents; vigilance is essential.
3. Avoid multiple people operating the same server – In a chaotic environment where many admins share the root password, simultaneous changes lead to conflicting configurations and wasted troubleshooting time.
4. Backup before making changes – Always copy configuration files (e.g.,
.conf) before editing, preferably commenting out original lines and then adding new ones. Regular backups would have mitigated the earlier
rsyncdisaster.
2. Data Considerations
1. Use rm -rf with extreme caution – A single typo can erase critical databases.
2. Backup is paramount – The author’s former employer performed full backups every two hours for payment systems and every 20 minutes for a lending platform.
3. Stability over speed – Prioritize a stable, reliable environment; avoid untested software upgrades in production.
4. Confidentiality matters – With frequent data leaks, protecting sensitive information is non‑negotiable.
3. Security Practices
SSH hardening
Change the default port.
Disable root login.
Use regular users with key authentication, sudo rules, IP restrictions, and user limits.
Deploy intrusion‑prevention tools that block repeated failed attempts.
Audit
/etc/passwdfor unauthorized users.
Firewall – Enable a firewall in production and follow the principle of least privilege: drop everything, then allow only required ports.
Fine‑grained permissions – Run services with the lowest‑possible privileges, never as root.
4. Intrusion detection and log monitoring
Use third‑party tools to watch critical system and service configuration files (e.g.,
/etc/passwd,
/etc/my.cnf,
/etc/httpd/conf).
Centralize log monitoring for
/var/log/secure,
/etc/log/message, FTP activity, etc.
Block scanning IPs via host‑deny lists; logs are invaluable for post‑incident analysis.
Fundamental security work dramatically improves system resilience.
4. Daily Monitoring
System health – Track hardware utilization (CPU, memory, disk, network) and OS metrics such as login activity and critical file changes.
Service health – Monitor web, database, load‑balancer, and other application metrics to quickly detect performance bottlenecks.
Log monitoring – Beyond security logs, watch application and OS error logs; proactive monitoring prevents reactive firefighting.
5. Performance Tuning
1. Understand the software’s internals – Knowing why Nginx outperforms Apache, for example, guides effective parameter adjustments.
2. Follow a tuning framework – Identify bottlenecks via logs, define a tuning direction, then adjust OS/hardware before touching database settings.
3. Change one parameter at a time – Isolating effects prevents confusion.
4. Benchmark – Use realistic benchmark tests to verify that changes improve performance without harming stability.
6. Ops Mindset
Control emotions – High‑stress moments (e.g., accidental
rm -rfnear shift end) require calm decision‑making; avoid handling critical data when upset.
Take responsibility for data – Production data is not a playground; lack of backups leads to severe consequences.
Root‑cause analysis – When issues recur, dig deeper; the author cites a case where repeated MySQL crashes were traced to OOM kills caused by insufficient memory.
Separate test and production – Verify operations on test machines and limit open windows to reduce human error.
Source: http://www.cnblogs.com/yihr/p/9593795.html?from=groupmessage Author: 油腻克斯
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.