Operations 12 min read

Essential Ops Lessons from 3.5 Years of Real-World Crises

Drawing from three and a half years of operations work, this article shares hard‑earned best practices on testing, backups, security, monitoring, performance tuning, and the right mindset to avoid costly mistakes such as data loss, service outages, and security breaches.

Open Source Linux

May 5, 2023

Essential Ops Lessons from 3.5 Years of Real-World Crises

1. Online Operation Standards

1. Test Before Use

When I first learned Linux on virtual machines, I became eager to try changes on a real server. I switched from PuTTY to Xshell and attempted to enable key‑based login without testing, which locked me out after restarting sshd. A backup of sshd_config saved the day.

2. Confirm Before Pressing Enter

Accidental rm -rf /var or similar commands can happen in a hurry or with a slow network. One mistake can cause irreversible data loss, so always double‑check commands before execution.

3. Avoid Multiple People Operating Simultaneously

In a chaotic environment where several admins share the root password, concurrent changes lead to conflicting configurations and wasted troubleshooting time. Coordinate changes and limit simultaneous access.

4. Backup Before Any Change

Always back up configuration files (e.g., .conf) and databases before modifying them. Comment out original options before editing, and keep regular backups to prevent catastrophic loss.

2. Data‑Related Practices

1. Use rm -rf With Extreme Caution

Even a small typo with rm -rf can delete critical data. Verify the target path thoroughly before running destructive commands.

2. Backup Is Paramount

In a third‑party payment platform we performed full backups every two hours; a loan platform backed up every 20 minutes. Frequent backups dramatically reduce risk.

3. Prioritize Stability Over Speed

Never deploy untested software (e.g., new Nginx + PHP‑FPM versions) in production. Choose the most stable stack rather than the fastest.

4. Confidentiality Is Critical

Data leaks and back‑door routers are common; always enforce strict confidentiality measures for any sensitive data.

3. Security Measures

1. Harden SSH

Change the default port, disable root login, use normal users with key authentication, sudo rules, IP restrictions, and host‑deny tools to block brute‑force attempts.

2. Enable a Minimal‑Rule Firewall

Apply a default‑deny policy and open only the ports required for services.

3. Fine‑Grained Permissions

Run services with the least privileged accounts; avoid running daemons as root.

4. Intrusion Detection and Log Monitoring

Deploy third‑party tools to watch critical files (e.g., /etc/passwd, /etc/my.cnf) and centralize logs such as /var/log/secure. Detect port scans and automatically block offending IPs.

4. Daily Monitoring

1. System Health Monitoring

Track hardware utilization—CPU, memory, disk, network—as well as login activity and key‑file changes to predict hardware failures and guide tuning.

2. Service Monitoring

Monitor web, database, and load‑balancer metrics to quickly identify performance bottlenecks.

3. Log Monitoring

Collect and analyze OS, application, and hardware logs; proactive log monitoring prevents silent failures.

5. Performance Tuning

1. Understand Runtime Mechanisms

Know why Nginx outperforms Apache, study source code, and be able to explain the underlying principles.

2. Tuning Framework and Order

Identify bottlenecks via logs, then tune. Prioritize hardware and OS before adjusting database settings.

3. Change One Parameter at a Time

Isolate the impact of each tweak to avoid confusion.

4. Benchmark Testing

Use benchmark tests to validate tuning effects; refer to resources like "High Performance MySQL" for methodology.

6. Ops Mindset

1. Control Your Emotions

Under pressure, avoid rash actions on critical data. If a deletion occurs, keep the database running, clone the disk with dd, and consider professional data recovery.

2. Take Responsibility for Data

Production data is not a toy; lack of backups leads to severe consequences.

3. Pursue Root Cause Analysis

When recurring issues appear (e.g., session table corruption), investigate underlying causes such as MyISAM bugs, OOM kills, or insufficient memory.

4. Separate Test and Production Environments

Always verify operations on test machines and avoid opening multiple terminal windows on production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Backup performance-tuning

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.