Operations 56 min read

10 Must‑Know Ops Pitfalls and How to Avoid Them

This guide reveals the ten most common operations mishaps—from accidental rm‑rf deletions to firewall rule errors—explains real‑world case studies, provides step‑by‑step remediation commands, and offers preventive best‑practice checklists, scripts, and monitoring setups to keep your production environment safe.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
10 Must‑Know Ops Pitfalls and How to Avoid Them

Overview

The article collects the ten most frequent operational mistakes encountered by junior engineers and provides concrete, reproducible examples, the exact impact of each error, a complete remediation procedure, and a set of preventive measures. All commands are verified on Ubuntu 22.04/24.04 and Rocky Linux 9.

1. Accident Scenarios

rm -rf with missing space (deleting /var)

Scenario: A junior runs rm -rf /var instead of rm -rf /var/ , wiping the entire /var directory. Process: The command deletes system logs, caches, and database files. On Ubuntu the deletion finishes in 3‑5 seconds, leaving the system unusable. Consequences: Loss of /var/lib/mysql , /var/spool , PID files, and a P0‑level outage. Correct Approach:

Install trash-cli and use trash-put instead of rm.

Optionally protect critical paths with safe‑rm or an alias alias rm='rm -i'.

Preventive Measures:

Deploy trash-cli on all servers.

Enforce a rm -i alias for non‑root users.

Maintain regular backups of /var and critical databases.

# Install trash-cli
sudo apt install -y trash-cli   # Ubuntu/Debian
sudo dnf install -y trash-cli   # Rocky Linux
# Safe delete example
trash-put /var/log/old-app.log

Direct production config edit without backup (nginx)

Scenario: An engineer edits /etc/nginx/nginx.conf directly, introduces a typo ( proyx_pass ) and reloads nginx. Process: The typo causes nginx to fail to start, leaving the service down. Consequences: All HTTP traffic returns 502, causing a service outage. Correct Approach:

Copy the original file before editing:

cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.bak.$(date +%F_%H%M%S)

.

Validate syntax with nginx -t before reloading.

Use etckeeper to version‑control /etc and revert changes easily.

Preventive Measures:

Enable etckeeper on all servers.

Require a successful nginx -t check in the change‑approval workflow.

Store configuration backups in a central Git repository.

# Backup before edit
cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.bak.$(date +%F_%H%M%S)
# Syntax check
nginx -t
# Reload only after successful test
systemctl reload nginx

Firewall rule order mistake (iptables)

Scenario: A new rule iptables -A INPUT -s 10.0.1.100 -p tcp --dport 3306 -j ACCEPT is appended after a REJECT all rule, so the accept rule is never reached. Process: The packet matches the earlier REJECT rule and is dropped. Consequences: Database port 3306 is inaccessible, causing application failures and service outages. Correct Approach:

Insert the rule before the REJECT rule:

iptables -I INPUT 5 -s 10.0.1.100 -p tcp --dport 3306 -j ACCEPT

.

Persist rules with iptables-save and load them on boot.

Prefer firewalld or nftables which handle ordering automatically.

Preventive Measures:

Document required ports per service and review rule order before applying changes.

Test connectivity from a remote host after each modification.

Version‑control firewall rules (e.g., store iptables-save output in Git).

# Insert rule before REJECT
iptables -I INPUT 5 -s 10.0.1.100 -p tcp --dport 3306 -j ACCEPT
# Save persistent rules
iptables-save > /etc/iptables/rules.v4

Disk full caused by unchecked logs

Scenario: Log files under /var/log grow unchecked; a 24 GB debug log fills the root partition. Process: The system cannot write new data, causing services and SSH to become unresponsive. Consequences: All applications crash, database writes fail, and the server becomes inaccessible. Correct Approach:

Configure logrotate for all application logs.

Set size limits in journald ( SystemMaxUse=2G).

Deploy a monitoring script that alerts when usage exceeds 80 %.

Preventive Measures:

Enforce log rotation policies in CI/CD pipelines.

Mount /var/log on a separate partition.

Run regular du scans and clean up old files.

# Example logrotate entry for a Java app
/opt/app/logs/*.log {
    daily
    rotate 30
    compress
    missingok
    notifempty
    copytruncate
    size 500M
    dateext
    dateformat -%Y%m%d
}

SSH hardening without proper preparation

Scenario: Changing /etc/ssh/sshd_config to use port 2222, disable root login, and disable password authentication, then restarting sshd. Process: The admin loses all remote access because the new port is not allowed by the firewall and no key is configured. Consequences: Complete loss of SSH access; recovery may require console or VNC access. Correct Approach:

Validate the configuration with sshd -t before restarting.

Keep a second SSH session open as a fallback.

Add the new port to the firewall before applying the change.

Use /etc/ssh/sshd_config.d/ to keep the main file untouched.

Preventive Measures:

Include a pre‑change checklist that verifies firewall rules and key presence.

Document the exact steps in a runbook and require peer review.

Automate the change with Ansible to ensure idempotence.

# Verify syntax first
sshd -t && systemctl reload sshd
# Add firewall rule before change
firewall-cmd --permanent --add-port=2222/tcp && firewall-cmd --reload

Time‑zone and NTP misconfiguration

Scenario: Different servers run with mixed time zones (UTC, UTC+8, UTC‑5) and no NTP synchronization. Process: Log timestamps cannot be correlated, making incident analysis difficult. Consequences: Longer MTTR, false alerts, and potential TLS certificate validation failures. Correct Approach:

Set a unified time zone (e.g., Asia/Shanghai or UTC) with timedatectl set-timezone.

Install and configure chrony with reliable NTP servers.

Enable chronyd service and verify synchronization.

Preventive Measures:

Include time‑zone and NTP configuration in the server provisioning playbook.

Monitor chronyc tracking output and alert if offset > 100 ms.

# Set timezone
timedatectl set-timezone Asia/Shanghai
# Install chrony (Ubuntu example)
sudo apt install -y chrony
# /etc/chrony.conf
server ntp.aliyun.com iburst
server ntp.tencent.com iburst
makestep 1.0 3
# Enable and start
systemctl enable --now chronyd

Service restart order causing dependency failures

Scenario: An operator restarts services in the order nginx → app → redis → mysql without waiting for database readiness. Process: The application attempts to connect to MySQL and Redis before they are fully up, resulting in startup failures. Consequences: Application health checks fail, load balancer removes the node, and traffic is shifted to other instances. Correct Approach:

Define dependencies in the systemd unit files using After= and Requires=.

Use scripts that poll mysqladmin ping and redis-cli ping before starting the app.

Preventive Measures:

Document the dependency graph and store it in the wiki.

Prefer systemd ordering over manual sequencing.

Run a health‑check script after each restart.

# Example app-service unit
[Unit]
Description=Application Service
After=network.target mysql.service redis.service
Requires=mysql.service redis.service

[Service]
ExecStart=/opt/app/bin/start.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Permissions set to 777 on sensitive files

Scenario: A developer runs chmod -R 777 /opt/app to fix a permission error. Process: All files become world‑readable and writable, exposing database credentials. Consequences: Sensitive information can be exfiltrated; compliance audits fail. Correct Approach:

Identify the service user (e.g., app) and set ownership accordingly.

Apply least‑privilege permissions: chmod 750 for directories, chmod 640 for files, chmod 600 for secrets.

Use ACLs for fine‑grained access when needed.

Preventive Measures:

Enforce a chmod 0027 umask in /etc/profile.d/umask.sh.

Run a periodic script that flags files with permissions > 750.

Add a code‑review rule that blocks commits containing chmod 777.

# Fix ownership and permissions
chown -R app:app /opt/app
find /opt/app -type d -exec chmod 750 {} \;
find /opt/app -type f -exec chmod 640 {} \;
chmod 600 /opt/app/config/credentials.yaml

Blind troubleshooting without checking logs

Scenario: A service fails to start; the engineer repeatedly restarts it, changes ports, and redeploys without looking at logs. Process: Time is wasted while the real cause (e.g., port already in use) remains hidden. Consequences: Increased MTTR, possible data loss, and loss of confidence in the engineer. Correct Approach:

Always inspect journalctl -u <em>service</em> --since "10 min ago" first.

Check the application’s own log files with tail -n 100.

Verify system resources (disk, memory, ports) before retrying.

Preventive Measures:

Publish a standard troubleshooting checklist in the wiki.

Train new hires on log‑first debugging.

Automate common checks with a script (see Section 5).

# Standard log‑first check
systemctl status myservice
journalctl -u myservice --since "5 min ago" -n 50
tail -n 100 /opt/myservice/log/app.log
ss -tlnp | grep 8080

2. Supporting Scripts and Automation

The article provides ready‑to‑use Bash scripts for safe deletion, configuration backup, disk monitoring, and a one‑click health‑check.

Safe Delete (trash‑cli wrapper)

#!/bin/bash
# safe-delete.sh – move files to trash instead of rm
set -euo pipefail
if ! command -v trash-put &>/dev/null; then
  echo "Install trash-cli first"
  exit 1
fi
PROTECTED=(/ /bin /boot /dev /etc /home /lib /lib64 /proc /root /sbin /sys /usr /var /opt)
for target in "$@"; do
  abs=$(realpath "$target" 2>/dev/null || echo "$target")
  for p in "${PROTECTED[@]}"; do
    [[ "$abs" == "$p" ]] && { echo "Refusing to delete protected path $abs"; continue 2; }
  done
  if [[ ! -e "$target" ]]; then
    echo "Warning: $target does not exist"
    continue
  fi
  trash-put "$target"
  echo "Moved $target to trash"
done

Configuration Backup (etckeeper wrapper)

#!/bin/bash
set -euo pipefail
FILE=$1
mkdir -p /var/backups/config
TS=$(date +%Y%m%d%H%M%S)
cp -p "$FILE" "/var/backups/config/$(basename "$FILE").$TS"
# Optional: commit with etckeeper
etckeeper commit "Backup $FILE at $TS"

Disk Monitor and Auto‑Cleanup

#!/bin/bash
WARN=80
CRIT=90
AUTO=95
LOG=/var/log/disk-monitor.log
while read usage mount; do
  u=${usage%
%}
  if (( u >= AUTO )); then
    echo "[CRITICAL] $mount at $usage% – cleaning" | tee -a $LOG
    # Example cleanup actions
    apt clean
    journalctl --vacuum-size=1G
    find /var/log -type f -name "*.log" -size +200M -exec truncate -s 0 {} \;
  elif (( u >= CRIT )); then
    echo "[CRITICAL] $mount at $usage%" | tee -a $LOG
  elif (( u >= WARN )); then
    echo "[WARN] $mount at $usage%" | tee -a $LOG
  fi
done < <(df -h --output=pcent,target | tail -n +2)

One‑Click Health Check

#!/bin/bash
echo "=== System Summary ==="
uname -a
uptime
echo "=== CPU ==="
top -bn1 | head -5
echo "=== Memory ==="
free -h
echo "=== Disk ==="
df -h | head -10
echo "=== Services Failed ==="
systemctl list-units --state=failed

3. Best Practices and Checklist

Always back up configuration files before editing.

Validate syntax (nginx -t, sshd -t, etc.) before reloading services.

Maintain a change‑management checklist (see Appendix A).

Use version control (etckeeper, Git) for /etc and application configs.

Deploy log rotation (logrotate, journald size limits) on every host.

Set a unified time zone and enable NTP (chrony).

Define service dependencies in systemd units.

Enforce least‑privilege file permissions; avoid chmod 777.

Adopt a log‑first troubleshooting workflow.

Monitor disk, memory, and service health with Prometheus + node_exporter.

4. Monitoring Quick‑Start (Prometheus + node_exporter + Grafana)

Install node_exporter on each server, add the targets to prometheus.yml, and import the “Node Exporter Full” dashboard (ID 1860) in Grafana. Example alert rules for disk, memory, CPU, and instance down are provided in the article.

5. Conclusion

By following the concrete examples, scripts, and checklists presented, operators can transform recurring operational blunders into repeatable, safe procedures. The combination of proper backups, syntax validation, ordered service restarts, strict permission policies, and proactive monitoring dramatically reduces MTTR, prevents outages, and raises overall system reliability.

MonitoringoperationsDevOpsLinuxSystem Administration
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.