Essential Ops Lessons: Avoid Disasters with Backups, Monitoring, and Secure Practices
This guide shares hard‑earned lessons from real‑world server administration, emphasizing careful testing, confirming commands before execution, limiting simultaneous operators, always backing up configurations, protecting data, tightening SSH and firewall security, implementing comprehensive monitoring, and applying disciplined performance‑tuning practices to maintain stable, reliable services.
1. Testing Use
When learning Linux, the author moved from virtual machines to a real server, eagerly trying XShell after receiving the root password. Changing the SSH configuration without testing locked the author out, requiring a restored sshd_config from backup.
A second incident involved rsync: the source and destination were reversed, causing rapid deletion of production data with no backup, illustrating the severe impact of a simple mistake.
2. Confirm Before Execution
Commands like rm -rf /var are easy to mistype, especially under pressure or slow network conditions. Experiencing such a mistake once makes the risk clear; it can happen to anyone.
3. Avoid Multiple Operators
In a chaotic environment where several departing operators share the root password, multiple people often debug the same server simultaneously. Conflicting changes make it hard to pinpoint the true cause of an issue.
4. Backup Before Changes
Always back up configuration files (e.g., .conf) before editing. Comment out original options, copy them, then modify. If a database backup existed, the rsync error would have been less damaging.
Data Handling
1. Use rm -rf Carefully
Many online examples show catastrophic deletions. If deletion is truly required, proceed with extreme caution.
2. Backup Is Paramount
Frequent backups are essential. The author’s former company, handling third‑party payments and loan platforms, backed up payment data every two hours and loan data every twenty minutes.
3. Stability Over Speed
Prioritize stability rather than raw speed. Avoid deploying untested software (e.g., Nginx + PHP‑FPM) in production, as PHP crashes can be mitigated by switching to Apache.
4. Confidentiality Is Critical
Data leaks and router backdoors underscore the necessity of keeping sensitive data confidential.
Security Practices
1. SSH Hardening
Change the default port.
Disable root login.
Use a normal user with key authentication, sudo rules, IP restrictions, and user limits.
Deploy host‑deny or similar tools to block brute‑force attempts.
Filter login users in /etc/passwd.
2. Firewall
Enable the firewall in production and follow the principle of least privilege: drop all traffic by default, then allow only required service ports.
3. Fine‑grained Permissions
Run services as non‑root whenever possible and restrict permissions to the minimum necessary.
4. Intrusion Detection & Log Monitoring
Use third‑party tools to monitor critical files (e.g., /etc/passwd, /etc/my.cnf, /etc/httpd/conf/httpd.conf) and centralize log monitoring for /var/log/secure, system messages, FTP activity, and port scans. Detecting scans can trigger host‑deny rules. Proper logging greatly aids post‑incident analysis.
Daily Monitoring
1. System Monitoring
Monitor hardware usage—memory, disk, CPU, NIC—and OS login activity and critical file changes. Regular monitoring predicts hardware failures and supports performance tuning.
2. Service Monitoring
Track metrics for web servers, databases, load balancers, etc., to quickly identify performance bottlenecks.
3. Log Monitoring
Similar to security logs but focuses on hardware, OS, and application error messages. Essential when issues arise.
Performance Tuning
1. Understand Runtime Mechanisms
Deeply understand how software such as Nginx and Apache operate. Without this knowledge, tuning is merely guesswork.
2. Tuning Framework & Order
Analyze bottlenecks, review logs, define a tuning direction, then act. Prioritize hardware and OS adjustments before tweaking database configurations.
3. Change One Parameter at a Time
Modify a single setting per test to avoid confusion.
4. Benchmark Testing
Validate the effectiveness of tuning and the stability of new software versions with thorough benchmark tests. Refer to “High Performance MySQL” for methodology.
Ops Mindset
1. Control Emotions
Avoid frantic commands near shift end; stay calm to prevent costly mistakes.
2. Responsibility for Data
Production data is not a toy; lack of backup leads to severe consequences.
3. Root‑Cause Investigation
Do not ignore recurring issues. Investigate underlying causes such as MyISAM bugs, MySQL bugs, OOM kills, or insufficient memory. In one case, upgrading physical memory resolved an OOM‑induced MySQL crash.
4. Test Before Production
Verify the target machine and limit open windows before critical operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Brother's Insights
A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
