Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring
Drawing from three and a half years of operations experience, this guide outlines practical online operation standards, data protection strategies, security measures, daily monitoring, performance tuning tips, and the right mindset to avoid costly incidents and ensure stable, secure systems.
Quick Ops Guidelines for Everyone
Preface: After three and a half years in operations, I have encountered data loss, website hijacking, accidental database deletions, and hacker attacks.
1. Online Operation Standards
1. Test Before Use
When learning Linux, I practiced everything on virtual machines. However, habits formed there led me to try risky changes on a production server, such as switching from PuTTY to Xshell and modifying SSH settings without testing, which locked me out until a backup of
sshd_configwas restored.
Another incident involved
rsync: a mistaken source‑destination order caused massive data deletion because no backup existed.
2. Confirm Before Pressing Enter
Commands like
rm -rf /varcan easily destroy data, especially when rushed or on a slow connection. One mistake is enough to teach the importance of caution.
3. Avoid Multiple Operators
In a chaotic environment where many people share the root password, simultaneous edits lead to conflicting changes and confusion, making troubleshooting difficult.
4. Backup Before Changing Anything
Always back up configuration files (e.g.,
.conf) before editing. Comment out original options, then modify a copy. Regular backups would have prevented the earlier
rsyncdisaster.
2. Data Concerns
1. Use rm -rf Sparingly A small mistake can cause huge loss; always double‑check before deleting.
2. Backup Is Paramount In my previous company, third‑party payment services were backed up every two hours, while a loan platform backed up every 20 minutes. Frequent backups are essential.
3. Stability Over Speed Prioritize a stable environment rather than the fastest. Avoid deploying untested software (e.g., new Nginx+PHP‑FPM combos) in production.
4. Confidentiality Is Critical With data leaks common, protecting sensitive information is non‑negotiable.
3. Security Measures
1. SSH Hardening
Change the default port.
Disable root login.
Use regular users with key authentication, sudo rules, IP restrictions, and user limits.
Deploy intrusion‑prevention tools that block repeated failed attempts.
Audit
/etc/passwdfor unauthorized users.
2. Firewall Enable a firewall in production and follow the principle of least privilege: drop all traffic by default and allow only required ports.
3. Fine‑Grained Permissions Run services with the least privileged user possible; never run them as root.
4. Intrusion Detection and Log Monitoring Use third‑party tools to watch critical files and configuration changes, centralize log monitoring (e.g.,
/var/log/secure,
/etc/log/message), and block scanning IPs via
host.deny. Strong logging aids post‑incident analysis.
4. Daily Monitoring
1. System Monitoring Track hardware usage (CPU, memory, disk, network) and OS metrics (login activity, critical file changes). Regular monitoring predicts hardware failures and guides tuning.
2. Service Monitoring Monitor web, database, and load‑balancer services to quickly detect performance bottlenecks.
3. Log Monitoring Observe application and OS logs for errors and alerts; without logs, incident response becomes passive.
5. Performance Tuning
1. Understand Underlying Mechanisms Before tweaking parameters, know why a software (e.g., Nginx vs. Apache) performs the way it does; reading source code may be necessary.
2. Tuning Framework and Order Identify bottlenecks via logs, then address them. Hardware and OS optimizations precede database tuning.
3. Change One Parameter at a Time Adjusting multiple settings simultaneously leads to confusion.
4. Benchmarking Use benchmark tests to verify the impact of changes and ensure they match real‑world workloads; refer to resources like "High Performance MySQL" for guidance.
6. Ops Mindset
1. Control Your Emotions When under pressure (e.g., accidental
rm -rfnear shift end), stay calm, consider worst‑case scenarios, and avoid hasty actions.
2. Take Responsibility for Data Production data is not a toy; lack of backups leads to severe consequences.
3. Dig Deep After fixing an issue, investigate root causes. For example, repeated MySQL crashes were due to OOM kills caused by insufficient memory and missing swap.
4. Separate Test and Production Always verify operations on the correct machine and avoid opening multiple terminals for critical tasks.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.