How to Tame a Rogue Backup Script That Crushed CPU on a Production Server
A production engineer receives a P2 CPU‑load alert, diagnoses the offending Python backup script by inspecting top and ps outputs, discovers it was compressing a 350 GB log directory, and resolves the issue with a forced kill and post‑mortem best‑practice advice.
Grafana raised a P2 alert: user-profile-service on an 8‑core host had a 5‑minute CPU load average > 10.
Step 1 – Identify the offending process
SSH to the server and run top:
ssh [email protected]
topThe first line of the top output shows a process with PID 21588 consuming 100 % of a CPU core, 12.5 GB virtual memory and 1.1 GB resident memory, running as user app_dev:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21588 app_dev 20 0 12.5g 1.1g 2212 R 100.0 3.5 2:15.88 /usr/bin/python3 /opt/scripts/dev_backup.shStep 2 – Confirm details with ps
Exit top (press q) and run: ps aux | grep 21588 The output confirms the full command line and that the process was started from an interactive SSH session ( pts/0), not as a daemon.
Step 3 – Inspect the script
Display the script /opt/scripts/dev_backup.sh:
#!/usr/bin/python3
import os, tarfile, time
print("--- 开始备份用户数据(临时)---")
source_dir = "/var/log/app/user_profile"
target_file = f"/tmp/backup-{int(time.time())}.tar.gz"
try:
with tarfile.open(target_file, "w:gz") as tar:
tar.add(source_dir, arcname=os.path.basename(source_dir))
print(f"--- 备份完成: {target_file} ---")
except Exception as e:
print(f"备份失败: {e}")The script uses the Python tarfile module in w:gz mode to compress the directory /var/log/app/user_profile. Compression with gzip is CPU‑intensive.
Step 4 – Determine data size
Check the size of the target directory: du -sh /var/log/app/user_profile The command returns 350 GB , far larger than the visible log files. Compressing 350 GB explains the sustained 100 % CPU usage.
Step 5 – Terminate the runaway process
A normal kill 21588 (SIGTERM) does not stop the script because it is busy in kernel‑level I/O. Force termination with SIGKILL: sudo kill -9 21588 The process disappears from top and the load average drops rapidly.
Step 6 – Post‑mortem recommendations
Never run untested scripts on production servers.
Use kill -9 only as a last resort, since it gives the process no chance to clean up.
Run heavy, temporary jobs with low CPU priority using nice: nice -n 19 python3 /opt/scripts/dev_backup.sh This sequence demonstrates a systematic approach to diagnosing high CPU load, interpreting top and ps output, verifying script behavior, and safely terminating a misbehaving process.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
