Essential Ops Playbook: 6 Key Practices to Prevent Disasters
Drawing from a year‑and‑a‑half of ops experience, this guide outlines six practical categories—online operation standards, data handling, security, daily monitoring, performance tuning, and mindset—to help engineers avoid costly mistakes and maintain stable, secure systems.
1. Online Operation Standards
1.1 Test Before Use
When learning Linux on virtual machines, I developed a habit of trying changes directly on production servers, which once locked me out after restarting sshd without a backup of sshd_config. A similar mistake with rsync caused accidental deletion of production data because the source and destination were reversed.
1.2 Confirm Before Enter
Commands like rm -rf /var can wipe critical data in an instant; a single slip can freeze your heart.
1.3 Avoid Multiple Operators
When many people edit the same server, conflicting changes become inevitable, leading to confusion and wasted effort.
1.4 Backup Before Change
Always back up configuration files (e.g., .conf) before editing, and comment out original options rather than overwriting them.
2. Data Handling
2.1 Use rm -rf with Extreme Caution
A tiny mistake with a recursive delete can cause massive loss; double‑check any destructive command.
2.2 Backup Is Paramount
In my previous company, third‑party payment services were backed up every two hours, while a loan platform backed up every 20 minutes. Frequent backups are essential.
2.3 Stability Over Speed
Prioritize a stable environment over the newest software; untested upgrades (e.g., switching from Apache to Nginx) often introduce more problems.
2.4 Confidentiality Is Critical
With data leaks commonplace, protecting sensitive information must be a top priority.
3. Security
3.1 SSH Hardening
Change the default port, disable root login, use regular users with key authentication, enforce sudo rules, restrict IPs, and employ host‑deny mechanisms to block repeated attacks.
3.2 Firewall
Enable firewalls in production and follow the principle of least privilege: drop everything by default and open only required ports.
3.3 Fine‑Grained Permissions
Run services with non‑root users whenever possible and limit each service to the minimum necessary privileges.
3.4 Intrusion Detection & Log Monitoring
Deploy third‑party tools to watch critical files (e.g., /etc/passwd, /etc/my.cnf) and centralize logs such as /var/log/secure and /etc/log/message for real‑time alerts.
4. Daily Monitoring
4.1 System Health
Monitor hardware metrics—CPU, memory, disk, network—as well as OS login activity and key file changes to predict failures.
4.2 Service Availability
Track web, database, and load‑balancer metrics so performance bottlenecks are detected early.
4.3 Log Surveillance
Beyond security logs, monitor application and hardware error logs; without them, troubleshooting becomes reactive.
5. Performance Tuning
5.1 Understand Underlying Mechanics
Before tweaking parameters, know why a software (e.g., Nginx) performs faster than alternatives; reading source code may be necessary.
5.2 Structured Tuning Process
Identify bottlenecks via logs, define a tuning direction, and adjust layers in order: hardware → OS → application → database.
5.3 Change One Parameter at a Time
Isolating each change prevents confusion about which adjustment produced the effect.
5.4 Benchmarking
Use realistic benchmark tests to verify improvements and ensure they match business workloads.
6. Ops Mindset
6.1 Control Your Emotions
When under pressure (e.g., accidental rm -rf minutes before shift end), stay calm, avoid touching critical data, and consider worst‑case recovery steps.
6.2 Take Responsibility for Data
Production data is not a playground; lack of backups leads to severe consequences.
6.3 Investigate Root Causes
After fixing an issue, dig deeper—e.g., a MySQL crash caused by OOM due to insufficient memory and missing swap.
6.4 Separate Test and Production
Always verify operations on the correct environment and limit open sessions.
Source: https://segmentfault.com/a/1190000010242487
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
