10 Rookie Ops Mistakes You Must Avoid – A Complete Checklist
This guide walks ops newcomers through the ten most common pitfalls—from accidental rm‑rf deletions and mis‑configured firewalls to unsafe chmod usage—and provides concrete remediation steps, ready‑to‑run shell scripts, best‑practice checklists, and monitoring setups to keep production environments stable and secure.
Overview
The article presents a practical, step‑by‑step analysis of the ten high‑frequency mistakes that junior operations engineers make, explains why each error occurs, shows real‑world incident details, and offers concrete corrective actions and preventive measures.
Top 10 Rookie Ops Pitfalls
Accidental rm -rf with a missing space – a typo turned rm -rf /var/log/ into rm -rf /var, wiping the entire /var tree and causing a P0 outage.
Editing production configs without backup – changing /etc/nginx/nginx.conf without a copy led to a broken configuration and a 502 cascade.
Incorrect firewall rule order – an
iptables REJECTrule placed before the intended ACCEPT for MySQL prevented the replica from connecting.
Disk‑space exhaustion – unchecked log growth filled the root partition, stopping all services.
SSH hardening without fallback – changing the port, disabling root login and password auth simultaneously locked the admin out.
Missing time‑zone/NTP configuration – servers running different time zones produced confusing cross‑service logs.
Improper service restart order – restarting the application before MySQL and Redis caused connection failures and health‑check errors.
Log files without rotation – a Java app’s app.log grew to 45 GB, making troubleshooting impossible.
Over‑permissive file permissions (chmod 777) – exposing configuration files with passwords to any user or process.
Skipping log inspection during failures – repeated restarts and a ticket for a new server wasted hours that could have been solved by journalctl -u <service>.
Correct Practices
Always create a backup before editing a file (e.g.,
cp -p /etc/nginx/nginx.conf /etc/nginx/nginx.conf.bak.$(date +%F)).
Validate syntax before reloading ( nginx -t, sshd -t, systemd-analyze verify).
Use a safe delete tool such as trash-cli or safe‑rm to avoid irreversible deletions.
Maintain a version‑controlled copy of /etc with etckeeper or Git.
Apply firewall changes with firewalld or nftables and verify order with iptables -L -n --line-numbers.
Synchronize time across all hosts using chrony and set the timezone to a common value (e.g., Asia/Shanghai).
Define explicit service dependencies in systemd unit files ( After=, Requires=) and use health‑check loops before starting dependent services.
Configure logrotate or application‑level rotation (size‑based, time‑based) for all log directories.
Enforce least‑privilege file permissions (e.g., 600 for secrets, avoid 777).
Adopt a standard troubleshooting workflow: check service status, view journalctl, inspect relevant logs, verify resource usage, then act.
Sample Scripts
Safe‑Delete Script (trash‑cli wrapper)
#!/bin/bash
set -euo pipefail
# Ensure trash‑cli is installed
if ! command -v trash-put &>/dev/null; then
echo "trash‑cli not installed" >&2
exit 1
fi
PROTECTED=(/ /bin /boot /dev /etc /home /lib /lib64 /proc /root /sbin /sys /usr /var /opt)
for target in "$@"; do
abs=$(realpath "$target" 2>/dev/null || echo "$target")
for p in "${PROTECTED[@]}"; do
if [[ "$abs" == "$p" ]]; then
echo "Refusing to delete protected path $abs" >&2
continue 2
fi
done
if [[ ! -e "$target" ]]; then
echo "File not found: $target" >&2
continue
fi
trash-put "$target"
echo "Moved $target to trash"
doneConfiguration Backup & Edit Helper
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/var/backups/config-history"
mkdir -p "$BACKUP_DIR"
cfg="$1"
if [[ ! -f "$cfg" ]]; then echo "Config $cfg not found"; exit 1; fi
ts=$(date +%Y%m%d%H%M%S)
base=$(basename "$cfg")
dir=$(dirname "$cfg" | tr '/' '_')
backup="$BACKUP_DIR/${dir}_${base}.$ts"
cp -p "$cfg" "$backup"
chmod 600 "$backup"
echo "Backed up $cfg → $backup"
${EDITOR:-vim} "$cfg"
# Optional syntax check for known services
case "$base" in
nginx.conf) nginx -t;;
sshd_config) sshd -t;;
*.conf) nginx -t;;
esacDisk‑Space Monitor & Auto‑Clean
#!/bin/bash
set -euo pipefail
WARN=80
CRIT=90
AUTO=95
LOG="/var/log/disk-monitor.log"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
while read -r usage mount; do
usage=${usage%
%}
if (( usage >= AUTO )); then
log "CRITICAL $mount usage $usage% – running auto‑clean"
# Example clean‑up actions
apt clean &>/dev/null || true
find /tmp -type f -mtime +7 -delete &>/dev/null || true
journalctl --vacuum-size=1G &>/dev/null || true
elif (( usage >= CRIT )); then
log "CRITICAL $mount usage $usage%"
elif (( usage >= WARN )); then
log "WARNING $mount usage $usage%"
fi
done < <(df -h --output=pcent,target | tail -n +2)System Health‑Check Script
#!/bin/bash
set -euo pipefail
echo "=== System Information ==="
printf "Host: %s
Uptime: %s
OS: %s
Kernel: %s
" "$(hostname)" "$(uptime -p)" "$(grep PRETTY_NAME /etc/os-release | cut -d'=' -f2)" "$(uname -r)"
echo "
=== CPU & Memory ==="
ps aux --sort=-%cpu | head -6
free -h
echo "
=== Disk Usage ==="
df -h --output=target,size,used,avail,pcent | tail -n +2
echo "
=== Listening Ports ==="
ss -tlnp | awk '{print $4,$NF}'
echo "
=== Failed Services ==="
systemctl --failed --no-legend || echo "All services running"Case Studies
Case 1 – rm -rf disaster : A trainee typed rm -rf /var/log/ but a stray space produced rm -rf /var, deleting /var. Recovery required an LVM snapshot; without snapshots the team had to reinstall core packages.
Case 2 – Firewall rule order : An engineer flushed iptables and added a DROP policy before inserting an ACCEPT for MySQL (3306). The replica lost connectivity, triggering a 40‑minute outage until the rule order was corrected.
Case 3 – SSH lockout : Changing the SSH port to 2222, disabling root login and password auth, then restarting sshd left no valid login method. The fix involved verifying syntax with sshd -t, creating a new user, adding a key, and updating the firewall.
Best‑Practice Checklist
Create a change ticket and obtain approval.
Backup configuration files (etckeeper, Git, or manual copy).
Validate syntax before applying changes.
Maintain a second SSH session as a fallback.
Document the exact steps and expected outcomes.
Verify service health after the change (status, logs, metrics).
Update monitoring alerts for new thresholds.
Review and close the ticket only after confirmation.
Monitoring Setup (Prometheus + node_exporter + Grafana)
Install node_exporter on each host, expose it on port 9100, and add a node job to prometheus.yml. Sample alert rules for disk usage (>85 % warning, >95 % critical), memory (>90 % warning), CPU (>90 % sustained) and instance down are provided. Grafana can import the official “Node Exporter Full” dashboard (ID 1860) to visualise metrics.
Conclusion
By understanding the root causes of common operational mistakes, applying the prescribed remediation steps, and automating safety nets with scripts and monitoring, junior operators can avoid costly outages, preserve system integrity, and accelerate their professional growth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
