Operations 10 min read

How to Tame a Rogue Backup Script That Crushed CPU on a Production Server

A production engineer receives a P2 CPU‑load alert, diagnoses the offending Python backup script by inspecting top and ps outputs, discovers it was compressing a 350 GB log directory, and resolves the issue with a forced kill and post‑mortem best‑practice advice.

Open Source Linux
Open Source Linux
Open Source Linux
How to Tame a Rogue Backup Script That Crushed CPU on a Production Server

Grafana raised a P2 alert: user-profile-service on an 8‑core host had a 5‑minute CPU load average > 10.

Step 1 – Identify the offending process

SSH to the server and run top:

ssh [email protected]
top

The first line of the top output shows a process with PID 21588 consuming 100 % of a CPU core, 12.5 GB virtual memory and 1.1 GB resident memory, running as user app_dev:

PID USER   PR  NI   VIRT   RES   SHR S %CPU %MEM    TIME+ COMMAND
21588 app_dev 20   0 12.5g 1.1g 2212 R 100.0 3.5  2:15.88 /usr/bin/python3 /opt/scripts/dev_backup.sh

Step 2 – Confirm details with ps

Exit top (press q) and run: ps aux | grep 21588 The output confirms the full command line and that the process was started from an interactive SSH session ( pts/0), not as a daemon.

Step 3 – Inspect the script

Display the script /opt/scripts/dev_backup.sh:

#!/usr/bin/python3
import os, tarfile, time
print("--- 开始备份用户数据(临时)---")
source_dir = "/var/log/app/user_profile"
target_file = f"/tmp/backup-{int(time.time())}.tar.gz"
try:
    with tarfile.open(target_file, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))
    print(f"--- 备份完成: {target_file} ---")
except Exception as e:
    print(f"备份失败: {e}")

The script uses the Python tarfile module in w:gz mode to compress the directory /var/log/app/user_profile. Compression with gzip is CPU‑intensive.

Step 4 – Determine data size

Check the size of the target directory: du -sh /var/log/app/user_profile The command returns 350 GB , far larger than the visible log files. Compressing 350 GB explains the sustained 100 % CPU usage.

Step 5 – Terminate the runaway process

A normal kill 21588 (SIGTERM) does not stop the script because it is busy in kernel‑level I/O. Force termination with SIGKILL: sudo kill -9 21588 The process disappears from top and the load average drops rapidly.

Step 6 – Post‑mortem recommendations

Never run untested scripts on production servers.

Use kill -9 only as a last resort, since it gives the process no chance to clean up.

Run heavy, temporary jobs with low CPU priority using nice: nice -n 19 python3 /opt/scripts/dev_backup.sh This sequence demonstrates a systematic approach to diagnosing high CPU load, interpreting top and ps output, verifying script behavior, and safely terminating a misbehaving process.

LinuxCPUprocessshell
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.