Operations 12 min read

Essential Ops Rules: Avoid Disasters with Proven Server Management Practices

This article shares hard‑earned operational guidelines—ranging from cautious command usage and backup habits to security hardening, monitoring, performance tuning, and the right mindset—to help engineers prevent costly incidents and keep production systems stable and secure.

Efficient Ops

Dec 9, 2019

Essential Ops Rules: Avoid Disasters with Proven Server Management Practices

1. Online Operation Guidelines

1. Test before using – The author recounts early Linux learning on VMs, then describes a real‑world mistake: changing the SSH daemon configuration without a backup, which locked them out of the server until the original sshd_config was restored.

Another incident involved rsync: using the wrong source/destination caused massive data loss because the command deleted the source directory when the target was empty.

2. Double‑check before pressing Enter – Mistyping destructive commands like rm -rf /var can happen quickly, especially under pressure or slow network conditions.

When you realize the damage, your heart will feel half‑frozen.

Even experienced engineers can suffer such accidents; vigilance is essential.

3. Avoid multiple people operating the same server – In a chaotic environment where many admins share the root password, simultaneous changes lead to conflicting configurations and wasted troubleshooting time.

4. Backup before making changes – Always copy configuration files (e.g., .conf) before editing, preferably commenting out original lines and then adding new ones. Regular backups would have mitigated the earlier rsync disaster.

2. Data Considerations

1. Use rm -rf with extreme caution – A single typo can erase critical databases.

2. Backup is paramount – The author’s former employer performed full backups every two hours for payment systems and every 20 minutes for a lending platform.

3. Stability over speed – Prioritize a stable, reliable environment; avoid untested software upgrades in production.

4. Confidentiality matters – With frequent data leaks, protecting sensitive information is non‑negotiable.

3. Security Practices

SSH hardening

Change the default port.

Disable root login.

Use regular users with key authentication, sudo rules, IP restrictions, and user limits.

Deploy intrusion‑prevention tools that block repeated failed attempts.

Audit /etc/passwd for unauthorized users.

Firewall – Enable a firewall in production and follow the principle of least privilege: drop everything, then allow only required ports.

Fine‑grained permissions – Run services with the lowest‑possible privileges, never as root.

4. Intrusion detection and log monitoring

Use third‑party tools to watch critical system and service configuration files (e.g., /etc/passwd, /etc/my.cnf, /etc/httpd/conf).

Centralize log monitoring for /var/log/secure, /etc/log/message, FTP activity, etc.

Block scanning IPs via host‑deny lists; logs are invaluable for post‑incident analysis.

Fundamental security work dramatically improves system resilience.

4. Daily Monitoring

System health – Track hardware utilization (CPU, memory, disk, network) and OS metrics such as login activity and critical file changes.

Service health – Monitor web, database, load‑balancer, and other application metrics to quickly detect performance bottlenecks.

Log monitoring – Beyond security logs, watch application and OS error logs; proactive monitoring prevents reactive firefighting.

5. Performance Tuning

1. Understand the software’s internals – Knowing why Nginx outperforms Apache, for example, guides effective parameter adjustments.

2. Follow a tuning framework – Identify bottlenecks via logs, define a tuning direction, then adjust OS/hardware before touching database settings.

3. Change one parameter at a time – Isolating effects prevents confusion.

4. Benchmark – Use realistic benchmark tests to verify that changes improve performance without harming stability.

6. Ops Mindset

Control emotions – High‑stress moments (e.g., accidental rm -rf near shift end) require calm decision‑making; avoid handling critical data when upset.

Take responsibility for data – Production data is not a playground; lack of backups leads to severe consequences.

Root‑cause analysis – When issues recur, dig deeper; the author cites a case where repeated MySQL crashes were traced to OOM kills caused by insufficient memory.

Separate test and production – Verify operations on test machines and limit open windows to reduce human error.

Source: http://www.cnblogs.com/yihr/p/9593795.html?from=groupmessage Author: 油腻克斯

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backup server management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.