Operations 12 min read

Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring

This article shares hard‑earned operational guidelines for Linux servers, covering safe testing, cautious use of rm ‑rf, the importance of backups, strict access control, SSH hardening, firewall rules, intrusion detection, systematic monitoring, performance tuning, and maintaining a calm mindset to prevent costly incidents.

ITPUB

Jun 20, 2019

Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring

1. Online Operation Norms

When learning Linux on virtual machines, it’s easy to develop risky habits that become dangerous on real servers; always test changes before applying them.

Example: switching from PuTTY to XShell with key authentication without testing locked the author out of the server, only recoverable because a backup of sshd_config existed.

Another example: a mistaken rsync direction caused data loss because the source directory was inadvertently deleted, highlighting the critical need for backups.

Before executing destructive commands like rm -rf /var, double‑check the command; a single mistake can cause severe downtime.

Multiple people operating the same server leads to configuration drift and confusion; always coordinate changes and avoid simultaneous edits.

Always back up configuration files (e.g., .conf) before modifying them.

Comment out original options before editing and copy the file.

Regular database backups can mitigate accidental rsync deletions.

Even a single backup can prevent catastrophic data loss.

2. Data Handling

Never use rm -rf lightly; many incidents involve accidental deletion of critical databases.

Backups are indispensable—some companies perform full backups every two hours, others every 20 minutes.

Data confidentiality is also vital; with frequent leaks and router backdoors, sensitive data must be protected.

3. Security Practices

SSH Hardening

Change the default port (though scanners can still find it).

Disable root login.

Use regular users with key authentication, sudo rules, IP restrictions, and user limits.

Deploy brute‑force protection tools (e.g., HostDeny) to block repeated attempts.

Audit /etc/passwd for authorized login users.

Firewall

Enable a firewall in production and follow the principle of least privilege: drop all traffic by default and allow only required service ports.

Fine‑grained Permissions

Run services with the least privileged user possible; avoid using root for services that can operate under a normal account.

Intrusion Detection & Log Monitoring

Use third‑party tools to monitor critical system and service configuration files for changes.

Centralize log monitoring for /var/log/secure, /etc/log/message, FTP activity, etc.

Block IPs that perform port scans and log these events for post‑incident analysis.

4. Daily Monitoring

System Monitoring

Track hardware usage such as memory, disk, CPU, network interfaces, and OS login activity.

Service Monitoring

Monitor web, database, and load‑balancer services to quickly detect performance bottlenecks.

Log Monitoring

Collect and analyze logs from hardware, OS, and applications; lack of monitoring makes incident response passive.

5. Performance Tuning

Understand Runtime Mechanisms

Before tuning, grasp how software like Nginx or Apache works, why Nginx is fast, and be able to read source code if needed.

Tuning Framework & Order

Analyze bottlenecks, review logs, define tuning direction, then adjust parameters; prioritize hardware and OS before database configuration.

Change One Parameter at a Time

Isolating each change prevents confusion about its impact.

Benchmark Testing

Validate tuning effectiveness and software stability with comprehensive benchmarks, referencing resources like "High Performance MySQL".

6. Ops Mindset

Control Your Mindset

During high‑pressure moments (e.g., before the end of a shift), stay calm and avoid critical operations if you’re stressed.

Take Responsibility for Data

Production data is not a toy; lack of backups leads to severe consequences.

Root‑Cause Analysis

After fixing an issue, investigate underlying causes (e.g., OOM kills due to insufficient memory) rather than applying temporary patches.

Test vs. Production

Always verify actions on the correct machine and minimize open windows before critical operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Server Administration

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.