Operations 12 min read

Essential Ops Lessons from 3.5 Years of Real-World Crises

Drawing from three and a half years of operations work, this article shares hard‑earned best practices on testing, backups, security, monitoring, performance tuning, and the right mindset to avoid costly mistakes such as data loss, service outages, and security breaches.

Open Source Linux
Open Source Linux
Open Source Linux
Essential Ops Lessons from 3.5 Years of Real-World Crises

1. Online Operation Standards

1. Test Before Use

When I first learned Linux on virtual machines, I became eager to try changes on a real server. I switched from PuTTY to Xshell and attempted to enable key‑based login without testing, which locked me out after restarting sshd. A backup of

sshd_config

saved the day.

2. Confirm Before Pressing Enter

Accidental

rm -rf /var

or similar commands can happen in a hurry or with a slow network. One mistake can cause irreversible data loss, so always double‑check commands before execution.

3. Avoid Multiple People Operating Simultaneously

In a chaotic environment where several admins share the root password, concurrent changes lead to conflicting configurations and wasted troubleshooting time. Coordinate changes and limit simultaneous access.

4. Backup Before Any Change

Always back up configuration files (e.g.,

.conf

) and databases before modifying them. Comment out original options before editing, and keep regular backups to prevent catastrophic loss.

2. Data‑Related Practices

1. Use rm -rf With Extreme Caution

Even a small typo with

rm -rf

can delete critical data. Verify the target path thoroughly before running destructive commands.

2. Backup Is Paramount

In a third‑party payment platform we performed full backups every two hours; a loan platform backed up every 20 minutes. Frequent backups dramatically reduce risk.

3. Prioritize Stability Over Speed

Never deploy untested software (e.g., new Nginx + PHP‑FPM versions) in production. Choose the most stable stack rather than the fastest.

4. Confidentiality Is Critical

Data leaks and back‑door routers are common; always enforce strict confidentiality measures for any sensitive data.

3. Security Measures

1. Harden SSH

Change the default port, disable root login, use normal users with key authentication, sudo rules, IP restrictions, and host‑deny tools to block brute‑force attempts.

2. Enable a Minimal‑Rule Firewall

Apply a default‑deny policy and open only the ports required for services.

3. Fine‑Grained Permissions

Run services with the least privileged accounts; avoid running daemons as root.

4. Intrusion Detection and Log Monitoring

Deploy third‑party tools to watch critical files (e.g.,

/etc/passwd

,

/etc/my.cnf

) and centralize logs such as

/var/log/secure

. Detect port scans and automatically block offending IPs.

4. Daily Monitoring

1. System Health Monitoring

Track hardware utilization—CPU, memory, disk, network—as well as login activity and key‑file changes to predict hardware failures and guide tuning.

2. Service Monitoring

Monitor web, database, and load‑balancer metrics to quickly identify performance bottlenecks.

3. Log Monitoring

Collect and analyze OS, application, and hardware logs; proactive log monitoring prevents silent failures.

5. Performance Tuning

1. Understand Runtime Mechanisms

Know why Nginx outperforms Apache, study source code, and be able to explain the underlying principles.

2. Tuning Framework and Order

Identify bottlenecks via logs, then tune. Prioritize hardware and OS before adjusting database settings.

3. Change One Parameter at a Time

Isolate the impact of each tweak to avoid confusion.

4. Benchmark Testing

Use benchmark tests to validate tuning effects; refer to resources like "High Performance MySQL" for methodology.

6. Ops Mindset

1. Control Your Emotions

Under pressure, avoid rash actions on critical data. If a deletion occurs, keep the database running, clone the disk with

dd

, and consider professional data recovery.

2. Take Responsibility for Data

Production data is not a toy; lack of backups leads to severe consequences.

3. Pursue Root Cause Analysis

When recurring issues appear (e.g., session table corruption), investigate underlying causes such as MyISAM bugs, OOM kills, or insufficient memory.

4. Separate Test and Production Environments

Always verify operations on test machines and avoid opening multiple terminal windows on production.

monitoringoperationsperformance-tuningLinuxsecuritySysadminbackup
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.