Operations 12 min read

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

Drawing from three and a half years of operations experience, this guide outlines practical online operation standards, data protection strategies, security measures, daily monitoring, performance tuning tips, and the right mindset to avoid costly incidents and ensure stable, secure systems.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

Quick Ops Guidelines for Everyone

Preface: After three and a half years in operations, I have encountered data loss, website hijacking, accidental database deletions, and hacker attacks.

1. Online Operation Standards

1. Test Before Use

When learning Linux, I practiced everything on virtual machines. However, habits formed there led me to try risky changes on a production server, such as switching from PuTTY to Xshell and modifying SSH settings without testing, which locked me out until a backup of

sshd_config

was restored.

Another incident involved

rsync

: a mistaken source‑destination order caused massive data deletion because no backup existed.

2. Confirm Before Pressing Enter

Commands like

rm -rf /var

can easily destroy data, especially when rushed or on a slow connection. One mistake is enough to teach the importance of caution.

3. Avoid Multiple Operators

In a chaotic environment where many people share the root password, simultaneous edits lead to conflicting changes and confusion, making troubleshooting difficult.

4. Backup Before Changing Anything

Always back up configuration files (e.g.,

.conf

) before editing. Comment out original options, then modify a copy. Regular backups would have prevented the earlier

rsync

disaster.

2. Data Concerns

1. Use rm -rf Sparingly A small mistake can cause huge loss; always double‑check before deleting.

2. Backup Is Paramount In my previous company, third‑party payment services were backed up every two hours, while a loan platform backed up every 20 minutes. Frequent backups are essential.

3. Stability Over Speed Prioritize a stable environment rather than the fastest. Avoid deploying untested software (e.g., new Nginx+PHP‑FPM combos) in production.

4. Confidentiality Is Critical With data leaks common, protecting sensitive information is non‑negotiable.

3. Security Measures

1. SSH Hardening

Change the default port.

Disable root login.

Use regular users with key authentication, sudo rules, IP restrictions, and user limits.

Deploy intrusion‑prevention tools that block repeated failed attempts.

Audit

/etc/passwd

for unauthorized users.

2. Firewall Enable a firewall in production and follow the principle of least privilege: drop all traffic by default and allow only required ports.

3. Fine‑Grained Permissions Run services with the least privileged user possible; never run them as root.

4. Intrusion Detection and Log Monitoring Use third‑party tools to watch critical files and configuration changes, centralize log monitoring (e.g.,

/var/log/secure

,

/etc/log/message

), and block scanning IPs via

host.deny

. Strong logging aids post‑incident analysis.

4. Daily Monitoring

1. System Monitoring Track hardware usage (CPU, memory, disk, network) and OS metrics (login activity, critical file changes). Regular monitoring predicts hardware failures and guides tuning.

2. Service Monitoring Monitor web, database, and load‑balancer services to quickly detect performance bottlenecks.

3. Log Monitoring Observe application and OS logs for errors and alerts; without logs, incident response becomes passive.

5. Performance Tuning

1. Understand Underlying Mechanisms Before tweaking parameters, know why a software (e.g., Nginx vs. Apache) performs the way it does; reading source code may be necessary.

2. Tuning Framework and Order Identify bottlenecks via logs, then address them. Hardware and OS optimizations precede database tuning.

3. Change One Parameter at a Time Adjusting multiple settings simultaneously leads to confusion.

4. Benchmarking Use benchmark tests to verify the impact of changes and ensure they match real‑world workloads; refer to resources like "High Performance MySQL" for guidance.

6. Ops Mindset

1. Control Your Emotions When under pressure (e.g., accidental

rm -rf

near shift end), stay calm, consider worst‑case scenarios, and avoid hasty actions.

2. Take Responsibility for Data Production data is not a toy; lack of backups leads to severe consequences.

3. Dig Deep After fixing an issue, investigate root causes. For example, repeated MySQL crashes were due to OOM kills caused by insufficient memory and missing swap.

4. Separate Test and Production Always verify operations on the correct machine and avoid opening multiple terminals for critical tasks.

Monitoringoperationsperformance tuningsecurityBackupsystem administration
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.