Operations 12 min read

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

This guide outlines critical operational practices for Linux server management, emphasizing thorough testing, cautious command execution, regular backups, strict access controls, comprehensive monitoring, performance tuning, and a disciplined mindset to avoid costly incidents and ensure system stability.

Efficient Ops

Jan 2, 2019

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

1. Online Operation Guidelines

1. Testing Usage When learning Linux, many start on virtual machines, but habits formed there can lead to reckless actions on real servers. The author recounts a mistake of changing SSH settings without testing, resulting in being locked out of the server, and a costly rsync mis‑sync that deleted production data due to a reversed source directory.

2. Confirm Before Enter Commands like rm -rf /var can cause severe damage, especially when typed quickly or over a slow connection.

When you realize the command has run, your heart will feel half‑frozen.

One mistake teaches you that operational accidents can happen to anyone.

3. Avoid Multiple People Operating Simultaneously When several operators share root passwords, conflicting changes can occur, making it hard to identify the true cause of an issue.

4. Backup Before Modifying Always back up configuration files (e.g., .conf) before editing, comment out original options, then modify copies. A database backup would have mitigated the rsync incident.

2. Data Handling

1. Use rm -rf with Extreme Caution Accidental deletions can cause massive loss; always double‑check before deleting.

2. Backup Is Paramount The author’s former company performed full backups every two hours for a payment platform and every 20 minutes for a lending platform.

3. Stability Over Speed Prioritize stability and availability; avoid untested software in production (e.g., switching from Apache to Nginx without testing).

4. Confidentiality Is Critical Data must be protected against leaks and backdoors.

3. Security Measures

1. SSH Hardening

Change the default port.

Disable root login.

Use regular users with key authentication, sudo rules, IP restrictions.

Deploy intrusion‑prevention tools that block repeated failed attempts.

Audit /etc/passwd for valid users.

2. Firewall Enable firewalls in production and follow the principle of least privilege: drop all traffic by default and allow only necessary ports.

3. Fine‑Grained Permissions Run services with non‑root users whenever possible and limit permissions to the minimum required.

4. Intrusion Detection and Log Monitoring

Use third‑party tools to monitor critical system files and service configurations.

Centralize log monitoring for /var/log/secure, /etc/log/message, FTP activity, etc.

Block IPs that perform port scans; log analysis helps during post‑incident forensics.

Basic security work dramatically improves system safety.

4. Daily Monitoring

1. System Health Monitoring Track hardware usage (CPU, memory, disk, network) and OS metrics (login activity, critical file changes). Regular monitoring predicts hardware failures and aids tuning.

2. Service Monitoring Monitor web, database, load balancer, and other services to quickly detect performance bottlenecks.

3. Log Monitoring Beyond security logs, monitor application and hardware error logs to stay proactive.

5. Performance Tuning

1. Understand Underlying Mechanisms Before tweaking parameters, know why a software (e.g., Nginx vs. Apache) performs the way it does; reading source code may be necessary.

2. Structured Tuning Process Identify bottlenecks via logs, define a tuning direction, then adjust. Typically, hardware and OS come first, followed by database tuning.

3. Change One Parameter at a Time Isolating each change prevents confusion about its impact.

4. Benchmarking Use realistic benchmark tests to verify that tuning improves performance and meets business needs.

6. Operational Mindset

1. Control Your Emotions Avoid making critical changes when stressed; if you must, postpone or delegate.

When a database is accidentally deleted, keep the MySQL process alive, clone the disk with dd, and attempt recovery before contacting a data‑recovery service.

2. Take Responsibility for Data Production data is not a toy; lack of backups leads to severe consequences.

3. Investigate Root Causes Do not settle for quick fixes; trace issues (e.g., MySQL crashes due to OOM) and address underlying resource problems.

4. Separate Test and Production Environments Always verify changes on test machines and limit open terminals during critical operations.

Source: http://www.cnblogs.com/yihr/p/9593795.html?from=groupmessage Author: 油腻克斯

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations security Server Management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.