Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring
Drawing from a year‑and‑a‑half of sysadmin experience, this guide outlines practical online operation standards, data protection habits, security hardening, daily monitoring, performance tuning, and the right mindset to keep production environments stable and resilient.
1. Online Operation Guidelines
1. Test Before Using – The author recounts early Linux learning on virtual machines, where frequent snapshots encouraged risky habits. When granted root access on a real server, an attempt to switch from PuTTY to Xshell and modify SSH settings without testing locked the author out, highlighting the need for backups and cautious changes.
2. Double‑Check Before Enter – Mistyping commands like rm -rf /var can cause catastrophic data loss, especially on slow connections. One accidental rsync direction reversal deleted production data, underscoring the importance of verification.
3. Avoid Multiple People Editing Simultaneously – In a chaotic environment where many operators share the root password, concurrent edits lead to conflicting configuration changes and confusion. Limiting simultaneous access prevents such chaos.
4. Backup Before Modifying – Always back up configuration files (e.g., .conf) before changes, comment out original options, and keep a copy of the original file. Regular backups would have mitigated the earlier rsync disaster.
2. Data Handling
1. Use rm -rf Sparingly – Deleting critical directories or databases with rm -rf can cause irreparable damage; extreme caution is required.
2. Backup Is Paramount – The author’s current employer backs up a third‑party payment site every two hours and a lending platform every 20 minutes. Frequent backups dramatically reduce risk.
3. Stability Over Speed – Prioritize a stable, reliable environment over the fastest solution. Test new software (e.g., Nginx vs. Apache) in a non‑production setting before deployment.
4. Confidentiality Is Critical – With data breaches common, protecting sensitive data and preventing leaks is essential.
3. Security Measures
1. SSH Hardening – Change the default port, disable root login, use regular users with key authentication, sudo rules, IP restrictions, and host‑deny tools to block repeated attacks.
2. Firewall – Enable firewalls in production, follow the principle of least privilege: drop everything by default and allow only necessary ports.
3. Fine‑Grained Permissions – Run services with the lowest possible privileges; avoid running anything as root unless absolutely required.
4. Intrusion Detection & Log Monitoring – Deploy third‑party tools to watch critical files (e.g., /etc/passwd, /etc/my.cnf, web server configs) and centralize logs (e.g., /var/log/secure, /etc/log/message) to detect anomalies and scans.
4. Daily Monitoring
1. System Health Monitoring – Track hardware utilization (CPU, memory, disk, network) and OS metrics (login activity, critical file changes) to predict failures and guide tuning.
2. Service Monitoring – Observe key services (web, database, load balancers) and their performance indicators to quickly spot bottlenecks.
3. Log Monitoring – Beyond security logs, monitor application and hardware logs to react promptly when issues arise.
5. Performance Tuning
1. Understand Underlying Mechanics – Before tweaking parameters, grasp how software (e.g., Nginx vs. Apache) works internally; source‑code insight can prevent blind adjustments.
2. Structured Tuning Process – Identify bottlenecks, analyze logs, define a tuning direction, then adjust. Prioritize hardware and OS optimizations before touching database settings.
3. Change One Parameter at a Time – Isolating each change avoids confusion and makes impact assessment clear.
4. Benchmarking – Use realistic benchmark tests to verify that tuning improves performance and stability; reference resources like "High Performance MySQL" for guidance.
6. Ops Mindset
1. Control Your Emotions – High‑stress moments (e.g., accidental rm -rf near shift end) require calm; avoid critical actions when frustrated.
2. Take Responsibility for Data – Production data is not a playground; lack of backups leads to severe consequences.
3. Pursue Root Causes – When issues recur, dig deeper (e.g., OOM kills MySQL due to insufficient memory) rather than applying superficial fixes.
4. Separate Test and Production – Always verify operations on the correct environment and limit simultaneous sessions to reduce mistakes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
