Essential Ops Lessons from 3.5 Years of Real-World Crises
Drawing from three and a half years of operations work, this article shares hard‑earned best practices on testing, backups, security, monitoring, performance tuning, and the right mindset to avoid costly mistakes such as data loss, service outages, and security breaches.
1. Online Operation Standards
1. Test Before Use
When I first learned Linux on virtual machines, I became eager to try changes on a real server. I switched from PuTTY to Xshell and attempted to enable key‑based login without testing, which locked me out after restarting sshd. A backup of
sshd_configsaved the day.
2. Confirm Before Pressing Enter
Accidental
rm -rf /varor similar commands can happen in a hurry or with a slow network. One mistake can cause irreversible data loss, so always double‑check commands before execution.
3. Avoid Multiple People Operating Simultaneously
In a chaotic environment where several admins share the root password, concurrent changes lead to conflicting configurations and wasted troubleshooting time. Coordinate changes and limit simultaneous access.
4. Backup Before Any Change
Always back up configuration files (e.g.,
.conf) and databases before modifying them. Comment out original options before editing, and keep regular backups to prevent catastrophic loss.
2. Data‑Related Practices
1. Use rm -rf With Extreme Caution
Even a small typo with
rm -rfcan delete critical data. Verify the target path thoroughly before running destructive commands.
2. Backup Is Paramount
In a third‑party payment platform we performed full backups every two hours; a loan platform backed up every 20 minutes. Frequent backups dramatically reduce risk.
3. Prioritize Stability Over Speed
Never deploy untested software (e.g., new Nginx + PHP‑FPM versions) in production. Choose the most stable stack rather than the fastest.
4. Confidentiality Is Critical
Data leaks and back‑door routers are common; always enforce strict confidentiality measures for any sensitive data.
3. Security Measures
1. Harden SSH
Change the default port, disable root login, use normal users with key authentication, sudo rules, IP restrictions, and host‑deny tools to block brute‑force attempts.
2. Enable a Minimal‑Rule Firewall
Apply a default‑deny policy and open only the ports required for services.
3. Fine‑Grained Permissions
Run services with the least privileged accounts; avoid running daemons as root.
4. Intrusion Detection and Log Monitoring
Deploy third‑party tools to watch critical files (e.g.,
/etc/passwd,
/etc/my.cnf) and centralize logs such as
/var/log/secure. Detect port scans and automatically block offending IPs.
4. Daily Monitoring
1. System Health Monitoring
Track hardware utilization—CPU, memory, disk, network—as well as login activity and key‑file changes to predict hardware failures and guide tuning.
2. Service Monitoring
Monitor web, database, and load‑balancer metrics to quickly identify performance bottlenecks.
3. Log Monitoring
Collect and analyze OS, application, and hardware logs; proactive log monitoring prevents silent failures.
5. Performance Tuning
1. Understand Runtime Mechanisms
Know why Nginx outperforms Apache, study source code, and be able to explain the underlying principles.
2. Tuning Framework and Order
Identify bottlenecks via logs, then tune. Prioritize hardware and OS before adjusting database settings.
3. Change One Parameter at a Time
Isolate the impact of each tweak to avoid confusion.
4. Benchmark Testing
Use benchmark tests to validate tuning effects; refer to resources like "High Performance MySQL" for methodology.
6. Ops Mindset
1. Control Your Emotions
Under pressure, avoid rash actions on critical data. If a deletion occurs, keep the database running, clone the disk with
dd, and consider professional data recovery.
2. Take Responsibility for Data
Production data is not a toy; lack of backups leads to severe consequences.
3. Pursue Root Cause Analysis
When recurring issues appear (e.g., session table corruption), investigate underlying causes such as MyISAM bugs, OOM kills, or insufficient memory.
4. Separate Test and Production Environments
Always verify operations on test machines and avoid opening multiple terminal windows on production.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.