Operations 51 min read

10 Rookie Ops Mistakes You Must Avoid – A Complete Checklist

This guide walks ops newcomers through the ten most common pitfalls—from accidental rm‑rf deletions and mis‑configured firewalls to unsafe chmod usage—and provides concrete remediation steps, ready‑to‑run shell scripts, best‑practice checklists, and monitoring setups to keep production environments stable and secure.

Raymond Ops
Raymond Ops
Raymond Ops
10 Rookie Ops Mistakes You Must Avoid – A Complete Checklist

Overview

The article presents a practical, step‑by‑step analysis of the ten high‑frequency mistakes that junior operations engineers make, explains why each error occurs, shows real‑world incident details, and offers concrete corrective actions and preventive measures.

Top 10 Rookie Ops Pitfalls

Accidental rm -rf with a missing space – a typo turned rm -rf /var/log/ into rm -rf /var, wiping the entire /var tree and causing a P0 outage.

Editing production configs without backup – changing /etc/nginx/nginx.conf without a copy led to a broken configuration and a 502 cascade.

Incorrect firewall rule order – an

iptables
REJECT

rule placed before the intended ACCEPT for MySQL prevented the replica from connecting.

Disk‑space exhaustion – unchecked log growth filled the root partition, stopping all services.

SSH hardening without fallback – changing the port, disabling root login and password auth simultaneously locked the admin out.

Missing time‑zone/NTP configuration – servers running different time zones produced confusing cross‑service logs.

Improper service restart order – restarting the application before MySQL and Redis caused connection failures and health‑check errors.

Log files without rotation – a Java app’s app.log grew to 45 GB, making troubleshooting impossible.

Over‑permissive file permissions (chmod 777) – exposing configuration files with passwords to any user or process.

Skipping log inspection during failures – repeated restarts and a ticket for a new server wasted hours that could have been solved by journalctl -u <service>.

Correct Practices

Always create a backup before editing a file (e.g.,

cp -p /etc/nginx/nginx.conf /etc/nginx/nginx.conf.bak.$(date +%F)

).

Validate syntax before reloading ( nginx -t, sshd -t, systemd-analyze verify).

Use a safe delete tool such as trash-cli or safe‑rm to avoid irreversible deletions.

Maintain a version‑controlled copy of /etc with etckeeper or Git.

Apply firewall changes with firewalld or nftables and verify order with iptables -L -n --line-numbers.

Synchronize time across all hosts using chrony and set the timezone to a common value (e.g., Asia/Shanghai).

Define explicit service dependencies in systemd unit files ( After=, Requires=) and use health‑check loops before starting dependent services.

Configure logrotate or application‑level rotation (size‑based, time‑based) for all log directories.

Enforce least‑privilege file permissions (e.g., 600 for secrets, avoid 777).

Adopt a standard troubleshooting workflow: check service status, view journalctl, inspect relevant logs, verify resource usage, then act.

Sample Scripts

Safe‑Delete Script (trash‑cli wrapper)

#!/bin/bash
set -euo pipefail
# Ensure trash‑cli is installed
if ! command -v trash-put &>/dev/null; then
  echo "trash‑cli not installed" >&2
  exit 1
fi
PROTECTED=(/ /bin /boot /dev /etc /home /lib /lib64 /proc /root /sbin /sys /usr /var /opt)
for target in "$@"; do
  abs=$(realpath "$target" 2>/dev/null || echo "$target")
  for p in "${PROTECTED[@]}"; do
    if [[ "$abs" == "$p" ]]; then
      echo "Refusing to delete protected path $abs" >&2
      continue 2
    fi
  done
  if [[ ! -e "$target" ]]; then
    echo "File not found: $target" >&2
    continue
  fi
  trash-put "$target"
  echo "Moved $target to trash"
done

Configuration Backup & Edit Helper

#!/bin/bash
set -euo pipefail
BACKUP_DIR="/var/backups/config-history"
mkdir -p "$BACKUP_DIR"
cfg="$1"
if [[ ! -f "$cfg" ]]; then echo "Config $cfg not found"; exit 1; fi
ts=$(date +%Y%m%d%H%M%S)
base=$(basename "$cfg")
dir=$(dirname "$cfg" | tr '/' '_')
backup="$BACKUP_DIR/${dir}_${base}.$ts"
cp -p "$cfg" "$backup"
chmod 600 "$backup"
echo "Backed up $cfg → $backup"
${EDITOR:-vim} "$cfg"
# Optional syntax check for known services
case "$base" in
  nginx.conf) nginx -t;;
  sshd_config) sshd -t;;
  *.conf) nginx -t;;
esac

Disk‑Space Monitor & Auto‑Clean

#!/bin/bash
set -euo pipefail
WARN=80
CRIT=90
AUTO=95
LOG="/var/log/disk-monitor.log"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
while read -r usage mount; do
  usage=${usage%
%}
  if (( usage >= AUTO )); then
    log "CRITICAL $mount usage $usage% – running auto‑clean"
    # Example clean‑up actions
    apt clean &>/dev/null || true
    find /tmp -type f -mtime +7 -delete &>/dev/null || true
    journalctl --vacuum-size=1G &>/dev/null || true
  elif (( usage >= CRIT )); then
    log "CRITICAL $mount usage $usage%"
  elif (( usage >= WARN )); then
    log "WARNING $mount usage $usage%"
  fi
done < <(df -h --output=pcent,target | tail -n +2)

System Health‑Check Script

#!/bin/bash
set -euo pipefail
echo "=== System Information ==="
printf "Host: %s
Uptime: %s
OS: %s
Kernel: %s
" "$(hostname)" "$(uptime -p)" "$(grep PRETTY_NAME /etc/os-release | cut -d'=' -f2)" "$(uname -r)"

echo "
=== CPU & Memory ==="
ps aux --sort=-%cpu | head -6
free -h

echo "
=== Disk Usage ==="
df -h --output=target,size,used,avail,pcent | tail -n +2

echo "
=== Listening Ports ==="
ss -tlnp | awk '{print $4,$NF}'

echo "
=== Failed Services ==="
systemctl --failed --no-legend || echo "All services running"

Case Studies

Case 1 – rm -rf disaster : A trainee typed rm -rf /var/log/ but a stray space produced rm -rf /var, deleting /var. Recovery required an LVM snapshot; without snapshots the team had to reinstall core packages.

Case 2 – Firewall rule order : An engineer flushed iptables and added a DROP policy before inserting an ACCEPT for MySQL (3306). The replica lost connectivity, triggering a 40‑minute outage until the rule order was corrected.

Case 3 – SSH lockout : Changing the SSH port to 2222, disabling root login and password auth, then restarting sshd left no valid login method. The fix involved verifying syntax with sshd -t, creating a new user, adding a key, and updating the firewall.

Best‑Practice Checklist

Create a change ticket and obtain approval.

Backup configuration files (etckeeper, Git, or manual copy).

Validate syntax before applying changes.

Maintain a second SSH session as a fallback.

Document the exact steps and expected outcomes.

Verify service health after the change (status, logs, metrics).

Update monitoring alerts for new thresholds.

Review and close the ticket only after confirmation.

Monitoring Setup (Prometheus + node_exporter + Grafana)

Install node_exporter on each host, expose it on port 9100, and add a node job to prometheus.yml. Sample alert rules for disk usage (>85 % warning, >95 % critical), memory (>90 % warning), CPU (>90 % sustained) and instance down are provided. Grafana can import the official “Node Exporter Full” dashboard (ID 1860) to visualise metrics.

Conclusion

By understanding the root causes of common operational mistakes, applying the prescribed remediation steps, and automating safety nets with scripts and monitoring, junior operators can avoid costly outages, preserve system integrity, and accelerate their professional growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringOperationsdevopsLinuxsecuritysystem-administrationShell Scripting
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.