Operations 12 min read

Essential Ops Playbook: 6 Key Practices to Prevent Disasters

Drawing from a year‑and‑a‑half of ops experience, this guide outlines six practical categories—online operation standards, data handling, security, daily monitoring, performance tuning, and mindset—to help engineers avoid costly mistakes and maintain stable, secure systems.

Efficient Ops

Aug 2, 2017

Essential Ops Playbook: 6 Key Practices to Prevent Disasters

1. Online Operation Standards

1.1 Test Before Use

When learning Linux on virtual machines, I developed a habit of trying changes directly on production servers, which once locked me out after restarting sshd without a backup of sshd_config. A similar mistake with rsync caused accidental deletion of production data because the source and destination were reversed.

1.2 Confirm Before Enter

Commands like rm -rf /var can wipe critical data in an instant; a single slip can freeze your heart.

1.3 Avoid Multiple Operators

When many people edit the same server, conflicting changes become inevitable, leading to confusion and wasted effort.

1.4 Backup Before Change

Always back up configuration files (e.g., .conf) before editing, and comment out original options rather than overwriting them.

2. Data Handling

2.1 Use rm -rf with Extreme Caution

A tiny mistake with a recursive delete can cause massive loss; double‑check any destructive command.

2.2 Backup Is Paramount

In my previous company, third‑party payment services were backed up every two hours, while a loan platform backed up every 20 minutes. Frequent backups are essential.

2.3 Stability Over Speed

Prioritize a stable environment over the newest software; untested upgrades (e.g., switching from Apache to Nginx) often introduce more problems.

2.4 Confidentiality Is Critical

With data leaks commonplace, protecting sensitive information must be a top priority.

3. Security

3.1 SSH Hardening

Change the default port, disable root login, use regular users with key authentication, enforce sudo rules, restrict IPs, and employ host‑deny mechanisms to block repeated attacks.

3.2 Firewall

Enable firewalls in production and follow the principle of least privilege: drop everything by default and open only required ports.

3.3 Fine‑Grained Permissions

Run services with non‑root users whenever possible and limit each service to the minimum necessary privileges.

3.4 Intrusion Detection & Log Monitoring

Deploy third‑party tools to watch critical files (e.g., /etc/passwd, /etc/my.cnf) and centralize logs such as /var/log/secure and /etc/log/message for real‑time alerts.

4. Daily Monitoring

4.1 System Health

Monitor hardware metrics—CPU, memory, disk, network—as well as OS login activity and key file changes to predict failures.

4.2 Service Availability

Track web, database, and load‑balancer metrics so performance bottlenecks are detected early.

4.3 Log Surveillance

Beyond security logs, monitor application and hardware error logs; without them, troubleshooting becomes reactive.

5. Performance Tuning

5.1 Understand Underlying Mechanics

Before tweaking parameters, know why a software (e.g., Nginx) performs faster than alternatives; reading source code may be necessary.

5.2 Structured Tuning Process

Identify bottlenecks via logs, define a tuning direction, and adjust layers in order: hardware → OS → application → database.

5.3 Change One Parameter at a Time

Isolating each change prevents confusion about which adjustment produced the effect.

5.4 Benchmarking

Use realistic benchmark tests to verify improvements and ensure they match business workloads.

6. Ops Mindset

6.1 Control Your Emotions

When under pressure (e.g., accidental rm -rf minutes before shift end), stay calm, avoid touching critical data, and consider worst‑case recovery steps.

6.2 Take Responsibility for Data

Production data is not a playground; lack of backups leads to severe consequences.

6.3 Investigate Root Causes

After fixing an issue, dig deeper—e.g., a MySQL crash caused by OOM due to insufficient memory and missing swap.

6.4 Separate Test and Production

Always verify operations on the correct environment and limit open sessions.

Source: https://segmentfault.com/a/1190000010242487

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Performance Tuning security backup system-administration

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.