Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring
This article shares hard‑earned operational guidelines for Linux servers, covering safe testing, cautious use of rm ‑rf, the importance of backups, strict access control, SSH hardening, firewall rules, intrusion detection, systematic monitoring, performance tuning, and maintaining a calm mindset to prevent costly incidents.
1. Online Operation Norms
When learning Linux on virtual machines, it’s easy to develop risky habits that become dangerous on real servers; always test changes before applying them.
Example: switching from PuTTY to XShell with key authentication without testing locked the author out of the server, only recoverable because a backup of sshd_config existed.
Another example: a mistaken rsync direction caused data loss because the source directory was inadvertently deleted, highlighting the critical need for backups.
Before executing destructive commands like rm -rf /var, double‑check the command; a single mistake can cause severe downtime.
Multiple people operating the same server leads to configuration drift and confusion; always coordinate changes and avoid simultaneous edits.
Always back up configuration files (e.g., .conf) before modifying them.
Comment out original options before editing and copy the file.
Regular database backups can mitigate accidental rsync deletions.
Even a single backup can prevent catastrophic data loss.
2. Data Handling
Never use rm -rf lightly; many incidents involve accidental deletion of critical databases.
Backups are indispensable—some companies perform full backups every two hours, others every 20 minutes.
Data confidentiality is also vital; with frequent leaks and router backdoors, sensitive data must be protected.
3. Security Practices
SSH Hardening
Change the default port (though scanners can still find it).
Disable root login.
Use regular users with key authentication, sudo rules, IP restrictions, and user limits.
Deploy brute‑force protection tools (e.g., HostDeny) to block repeated attempts.
Audit /etc/passwd for authorized login users.
Firewall
Enable a firewall in production and follow the principle of least privilege: drop all traffic by default and allow only required service ports.
Fine‑grained Permissions
Run services with the least privileged user possible; avoid using root for services that can operate under a normal account.
Intrusion Detection & Log Monitoring
Use third‑party tools to monitor critical system and service configuration files for changes.
Centralize log monitoring for /var/log/secure, /etc/log/message, FTP activity, etc.
Block IPs that perform port scans and log these events for post‑incident analysis.
4. Daily Monitoring
System Monitoring
Track hardware usage such as memory, disk, CPU, network interfaces, and OS login activity.
Service Monitoring
Monitor web, database, and load‑balancer services to quickly detect performance bottlenecks.
Log Monitoring
Collect and analyze logs from hardware, OS, and applications; lack of monitoring makes incident response passive.
5. Performance Tuning
Understand Runtime Mechanisms
Before tuning, grasp how software like Nginx or Apache works, why Nginx is fast, and be able to read source code if needed.
Tuning Framework & Order
Analyze bottlenecks, review logs, define tuning direction, then adjust parameters; prioritize hardware and OS before database configuration.
Change One Parameter at a Time
Isolating each change prevents confusion about its impact.
Benchmark Testing
Validate tuning effectiveness and software stability with comprehensive benchmarks, referencing resources like "High Performance MySQL".
6. Ops Mindset
Control Your Mindset
During high‑pressure moments (e.g., before the end of a shift), stay calm and avoid critical operations if you’re stressed.
Take Responsibility for Data
Production data is not a toy; lack of backups leads to severe consequences.
Root‑Cause Analysis
After fixing an issue, investigate underlying causes (e.g., OOM kills due to insufficient memory) rather than applying temporary patches.
Test vs. Production
Always verify actions on the correct machine and minimize open windows before critical operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
