Operations 14 min read

Essential Ops Checklist: Avoid Disasters with Proven Practices

A seasoned operations engineer shares a comprehensive guide covering online operation standards, data handling, security hardening, daily monitoring, performance tuning, and the right mindset to prevent costly incidents and ensure stable, secure, and efficient production environments.

ITPUB

Apr 21, 2018

Essential Ops Checklist: Avoid Disasters with Proven Practices

1. Online Operation Standards

1.1 Test Before Use

When first gaining server access, the author mistakenly switched from PuTTY to Xshell and changed the SSH configuration without testing, resulting in being locked out until a backup of sshd_config was restored.

1.2 Confirm Before Enter

Careless rm -rf commands can delete critical data; a single typo in an rsync command caused irreversible loss of production files because no backup existed.

1.3 Avoid Multiple Operators

Sharing root passwords among many operators leads to configuration drift and confusion; the author describes a chaotic scenario where several team members edited the same server simultaneously, making it impossible to pinpoint the true cause of issues.

1.4 Backup Before Changes

Always back up configuration files (e.g., .conf) before modifying them, preferably by commenting out original options and copying the file, so that a quick revert is possible.

2. Data Handling

2.1 Use rm -rf Cautiously

Blindly executing destructive commands leads to severe data loss; double‑check paths and necessity before running such commands.

2.2 Backup Is Paramount

Regular backups are essential. The author cites examples where a third‑party payment platform backs up every two hours, while a loan platform backs up every 20 minutes, emphasizing that frequent backups dramatically reduce risk.

2.3 Stability Over Speed

Prioritize stability and availability over raw performance; avoid deploying untested software versions (e.g., new Nginx + PHP‑FPM stacks) in production without thorough validation.

2.4 Confidentiality Matters

Data must be kept confidential; the article notes the prevalence of leaked private images and router backdoors, underscoring the need for strict access controls.

3. Security

3.1 SSH Hardening

Change the default SSH port, disable direct root login, enforce key‑based authentication, apply sudo rules, restrict login by IP, and use tools like hostdeny to block repeated failed attempts.

3.2 Firewall Configuration

Enable the firewall in production and follow the principle of least privilege: drop all traffic by default and explicitly allow only required service ports.

3.3 Fine‑Grained Permissions

Run services with the lowest possible privileges; avoid running daemons as root and limit each service’s access to only what it needs.

3.4 Intrusion Detection and Log Monitoring

Deploy third‑party tools to watch critical files (e.g., /etc/passwd, /etc/my.cnf, /etc/httpd/conf/httpd.conf) for unauthorized changes, and use centralized log aggregation to monitor /var/log/secure, /etc/log/message, and FTP activity. Block scanning IPs via host.deny and retain logs for post‑incident analysis.

4. Daily Monitoring

4.1 System Health Monitoring

Track hardware utilization such as CPU, memory, disk, and network interfaces, as well as OS login events and critical file integrity, to predict hardware failures and guide performance tuning.

4.2 Service Health Monitoring

Monitor key metrics of web servers, databases, load balancers, etc., so performance bottlenecks can be detected and addressed promptly.

4.3 Log Monitoring

Collect and analyze OS and application error logs; while logs may seem unnecessary during stable operation, they become vital when issues arise.

5. Performance Tuning

5.1 Understand Runtime Mechanisms

Deep knowledge of software internals (e.g., why Nginx processes requests faster than Apache) is required before adjusting parameters; reading source code may be necessary.

5.2 Tuning Framework and Order

Identify the bottleneck first, analyze logs, define a tuning direction, then address OS/hardware issues before moving to database configuration, which should be the last step.

5.3 Change One Parameter at a Time

Modify a single setting per iteration to avoid confusion and isolate the impact of each change.

5.4 Benchmark Testing

Perform baseline benchmarks to verify whether tuning improves performance or stability; reference materials such as "High Performance MySQL" can guide testing methodology.

6. Ops Mindset

6.1 Control Emotions

Stay calm during critical incidents; avoid making hasty changes when under pressure, especially when dealing with destructive commands.

6.2 Responsibility for Data

Production data is not a playground; always ensure backups exist to mitigate severe consequences.

6.3 Root‑Cause Analysis

Investigate recurring failures thoroughly; the author describes MySQL crashes caused by OOM kills due to insufficient memory and lack of swap, which were resolved by adding physical RAM.

6.4 Test in Non‑Production

Validate all critical operations on test machines before applying them to production, and minimize the number of open terminal windows to reduce accidental mistakes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring incident response

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.