Essential Ops Playbook: Avoid Costly Mistakes in Server Management
This guide shares practical Linux server operation rules, emphasizing thorough testing, careful use of destructive commands, strict access control, regular backups, security hardening, continuous monitoring, and disciplined performance tuning to prevent costly outages and data loss.
Online Operation Standards
1. Test Usage When learning Linux, many start on virtual machines, but the habit of experimenting without proper testing can lead to serious issues once you gain root access on real servers.
On my first day at work I switched from PuTTY to Xshell and changed the SSH configuration without testing, which locked me out of the server until the original
sshd_configwas restored.
Another example: using
rsyncfor synchronization can unintentionally delete source data if the source and destination are reversed, resulting in loss of production data.
2. Confirm Before Pressing Enter Commands like
rm -rf /varcan easily be mistyped, especially when working quickly or under slow network conditions.
When you realize the command has executed, your heart will at least be half‑frozen.
Even if you have never made a mistake, a single slip can cause a disaster; never assume that operational incidents happen only to others.
3. Avoid Multiple Operators In a chaotic environment where many people know the root password, simultaneous changes can overwrite each other's work, making troubleshooting extremely frustrating.
4. Backup Before Changing Always back up configuration files (e.g.,
.conf) before editing. Comment out original options, then copy and modify them. Regular database backups would have prevented the rsync mishap.
Data‑Related Guidelines
1. Use rm -rf with Extreme Caution Many online examples show disastrous deletions; a tiny mistake can cause massive loss.
2. Backup Is Paramount In my previous company, third‑party payment services were backed up every two hours, while a loan platform backed up every 20 minutes.
3. Stability Over Speed Prioritize stability and availability over raw performance; avoid deploying untested software in production.
4. Confidentiality Is Critical With frequent data leaks, protecting sensitive data is non‑negotiable.
Security Practices
1. SSH Hardening
Change the default port.
Disable root login.
Use regular user + key authentication + sudo rules + IP restrictions.
Deploy brute‑force protection tools (e.g., hostdeny).
Audit
/etc/passwdfor valid login users.
2. Firewall Enable the firewall in production and follow the principle of least privilege: drop all traffic by default and allow only necessary ports.
3. Fine‑Grained Permissions Run services with the least privileged user; never run them as root.
4. Intrusion Detection and Log Monitoring
Use third‑party tools to monitor critical system and service configuration files for changes.
Centralize log monitoring for
/var/log/secure,
/etc/log/message, FTP activity, etc.
Detect port scans and block offending IPs via
host.deny.
Effective security starts with solid fundamentals; once basics are covered, advanced measures become easier to implement.
Daily Monitoring
1. System Monitoring Track hardware utilization (CPU, memory, disk, network) and OS metrics such as login activity and critical file changes.
2. Service Monitoring Monitor web, database, load balancer, and other application metrics to quickly detect performance bottlenecks.
3. Log Monitoring Observe hardware, OS, and application error logs; without monitoring, issues become reactive rather than proactive.
Performance Tuning
1. Understand Underlying Mechanisms Before tuning, grasp how software (e.g., Nginx vs. Apache) works internally; otherwise, tuning is guesswork.
2. Tuning Framework and Order Identify bottlenecks via logs, define a tuning direction, and address hardware/OS before database configuration.
3. Change One Parameter at a Time Isolating changes prevents confusion.
4. Benchmark Testing Use benchmarks to verify the impact of changes and to assess new software versions.
Operational Mindset
1. Control Your Emotions Avoid making critical changes when stressed; if possible, defer to a calmer time.
2. Take Responsibility for Data Production data is not a toy; always ensure backups exist.
3. Investigate Root Causes When recurring issues arise, dig deeper rather than applying quick fixes.
4. Test Before Production Verify operations on test machines and avoid opening multiple terminals for critical tasks.
Source: http://www.cnblogs.com/yihr/p/9593795.html
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.