Essential Ops Checklist: Avoid Disasters with Proven Practices
A seasoned operations engineer shares a comprehensive guide covering online operation standards, data handling, security hardening, daily monitoring, performance tuning, and the right mindset to prevent costly incidents and ensure stable, secure, and efficient production environments.
1. Online Operation Standards
1.1 Test Before Use
When first gaining server access, the author mistakenly switched from PuTTY to Xshell and changed the SSH configuration without testing, resulting in being locked out until a backup of sshd_config was restored.
1.2 Confirm Before Enter
Careless rm -rf commands can delete critical data; a single typo in an rsync command caused irreversible loss of production files because no backup existed.
1.3 Avoid Multiple Operators
Sharing root passwords among many operators leads to configuration drift and confusion; the author describes a chaotic scenario where several team members edited the same server simultaneously, making it impossible to pinpoint the true cause of issues.
1.4 Backup Before Changes
Always back up configuration files (e.g., .conf) before modifying them, preferably by commenting out original options and copying the file, so that a quick revert is possible.
2. Data Handling
2.1 Use rm -rf Cautiously
Blindly executing destructive commands leads to severe data loss; double‑check paths and necessity before running such commands.
2.2 Backup Is Paramount
Regular backups are essential. The author cites examples where a third‑party payment platform backs up every two hours, while a loan platform backs up every 20 minutes, emphasizing that frequent backups dramatically reduce risk.
2.3 Stability Over Speed
Prioritize stability and availability over raw performance; avoid deploying untested software versions (e.g., new Nginx + PHP‑FPM stacks) in production without thorough validation.
2.4 Confidentiality Matters
Data must be kept confidential; the article notes the prevalence of leaked private images and router backdoors, underscoring the need for strict access controls.
3. Security
3.1 SSH Hardening
Change the default SSH port, disable direct root login, enforce key‑based authentication, apply sudo rules, restrict login by IP, and use tools like hostdeny to block repeated failed attempts.
3.2 Firewall Configuration
Enable the firewall in production and follow the principle of least privilege: drop all traffic by default and explicitly allow only required service ports.
3.3 Fine‑Grained Permissions
Run services with the lowest possible privileges; avoid running daemons as root and limit each service’s access to only what it needs.
3.4 Intrusion Detection and Log Monitoring
Deploy third‑party tools to watch critical files (e.g., /etc/passwd, /etc/my.cnf, /etc/httpd/conf/httpd.conf) for unauthorized changes, and use centralized log aggregation to monitor /var/log/secure, /etc/log/message, and FTP activity. Block scanning IPs via host.deny and retain logs for post‑incident analysis.
4. Daily Monitoring
4.1 System Health Monitoring
Track hardware utilization such as CPU, memory, disk, and network interfaces, as well as OS login events and critical file integrity, to predict hardware failures and guide performance tuning.
4.2 Service Health Monitoring
Monitor key metrics of web servers, databases, load balancers, etc., so performance bottlenecks can be detected and addressed promptly.
4.3 Log Monitoring
Collect and analyze OS and application error logs; while logs may seem unnecessary during stable operation, they become vital when issues arise.
5. Performance Tuning
5.1 Understand Runtime Mechanisms
Deep knowledge of software internals (e.g., why Nginx processes requests faster than Apache) is required before adjusting parameters; reading source code may be necessary.
5.2 Tuning Framework and Order
Identify the bottleneck first, analyze logs, define a tuning direction, then address OS/hardware issues before moving to database configuration, which should be the last step.
5.3 Change One Parameter at a Time
Modify a single setting per iteration to avoid confusion and isolate the impact of each change.
5.4 Benchmark Testing
Perform baseline benchmarks to verify whether tuning improves performance or stability; reference materials such as "High Performance MySQL" can guide testing methodology.
6. Ops Mindset
6.1 Control Emotions
Stay calm during critical incidents; avoid making hasty changes when under pressure, especially when dealing with destructive commands.
6.2 Responsibility for Data
Production data is not a playground; always ensure backups exist to mitigate severe consequences.
6.3 Root‑Cause Analysis
Investigate recurring failures thoroughly; the author describes MySQL crashes caused by OOM kills due to insufficient memory and lack of swap, which were resolved by adding physical RAM.
6.4 Test in Non‑Production
Validate all critical operations on test machines before applying them to production, and minimize the number of open terminal windows to reduce accidental mistakes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
