Operations 16 min read

Step‑by‑Step Investigation of a High‑Load Production Server

During a mid‑year promotion an e‑commerce platform experienced a sudden spike in load average and response latency; the article walks through a systematic, command‑driven investigation that identifies an I/O bottleneck caused by mis‑configured log rotation and excessive debug logging, and presents immediate and long‑term remediation steps.

MaGe Linux Operations

Apr 29, 2026

Step‑by‑Step Investigation of a High‑Load Production Server

Problem Background

In June 2025, an e‑commerce platform’s web servers (24‑core CPU, 32 GB RAM) triggered a CPU load alert at 14:30 during a mid‑year promotion. Load average jumped from 5‑8 to over 42, while P99 response time rose from 200 ms to 3800 ms, causing timeouts and 502 errors.

1. First Reaction

Run uptime to see load average 42.35 / 28.17 / 15.42. On a 24‑core machine this means ~1.76 runnable tasks per core, indicating saturation.

2. Global Scan – Qualitative Bottleneck Identification

2.1 top – CPU time distribution

$ top
...
%Cpu(s): 8.5 us, 5.2 sy, 32.1 id, 53.8 wa

The key finding is I/O wait (wa) = 53.8 %, while idle (id) = 32.1 % and user+system only 13.7 %, showing an I/O bottleneck.

2.2 vmstat – Confirm blocking

$ vmstat 1 5
...
b 38   # blocked (uninterruptible) processes
r 3    # run queue length

38 processes are blocked on I/O; CPU run queue is short, confirming the bottleneck lies in the I/O subsystem.

2.3 free – Memory and swap

$ free -h -w
Mem: 31G total, 14G used, 13G available
Swap: 4G total, 3G used

Memory is sufficient, but swap usage indicates some pressure.

2.4 First‑stage Summary

Bottleneck type: I/O (wa = 53.8 %, b = 38)

Resource status: CPU has headroom, memory OK, modest swap

Next step: Identify which process generates the heavy I/O.

3. Locating the I/O Source

3.1 iostat – Disk activity

$ iostat -xdm 1 3
Device  w/s   wkB/s   w_await  %util
vda    4500.3 360000 112.3   98.7

4500 writes per second, average write latency 112 ms, and 98.7 % utilization indicate a saturated SSD.

3.2 iotop – Process‑level I/O

$ iotop -o
... java 320 MB/s write, 95.2 % I/O wait
... nginx 3.84 MB/s read

The Java process accounts for almost all disk writes.

3.3 pidstat -d – Per‑process I/O

$ pidstat -d 1 3
PID   kB_wr/s   Command
3456  320000    java

3.4 File‑level verification

Using lsof shows the Java process keeps file descriptors open for app.log and app.log.1. The latter is a rotated log that is still being written to because the process holds the old file handle.

4. Root‑Cause Confirmation

4.1 Findings

Debug logging was enabled during the promotion, increasing log volume from ~20 MB/min to ~20 GB/min.

logrotate used the create option, which renames the current log and creates a new file, leaving the Java process writing to the renamed app.log.1.

The massive write volume plus double‑file writes saturated the disk, causing system‑wide I/O blockage and load spikes.

4.2 Immediate Recovery Steps

Force the Java process to release the deleted file handle (e.g., send USR1 to reload log4j2, or truncate the fd via : > /proc/<PID>/fd/<FD_NUMBER>).

Switch the log level back to WARN/INFO.

Verify I/O normalization with iostat and uptime.

4.3 Long‑Term Fixes

Change logrotate to use copytruncate so the process continues writing to the same file.

Enable asynchronous logging (log4j2 AsyncAppender or logback AsyncAppender).

Separate /var/log onto its own partition.

Formalize log‑level change approvals and automatic rollback.

5. Additional Production‑Level Cases

Case A – MySQL double‑write causing disk saturation

High‑load MySQL with innodb_flush_log_at_trx_commit=1 and sync_binlog=1 leads to >70 % I/O wait; temporary mitigation by lowering durability, long‑term by using higher‑IOPS disks and group commit.

Case B – TIME_WAIT accumulation

Short‑lived connections fill the TCP TIME_WAIT table (≈62 000 entries), causing connection refusals; immediate relief by enabling tcp_tw_reuse and reducing tcp_fin_timeout, long‑term by using connection pooling.

Case C – unattended‑upgrades I/O spike on a small VM

Ubuntu’s unattended-upgrades triggers heavy disk reads; stop the timer or disable the service to prevent I/O saturation.

6. Netflix “60‑Second Rule” in Practice

The investigation follows the Netflix performance‑engineering “60‑second rule”: run a predefined set of commands within the first minute after login to obtain a complete system snapshot. The command list (uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, top, ss) is reproduced in the article.

7. Production‑Level Investigation Guidelines

Do not blindly reboot a high‑load system; preserve evidence.

Collect all relevant metrics before taking corrective actions.

Base every judgment on concrete data (e.g., wa = 53.8 %).

Apply changes gradually (gray‑scale) on a subset of nodes.

Backup configuration files before modification.

Record who performed what operation and when.

8. Conclusion

Effective load‑high troubleshooting is about pinpointing the true bottleneck—CPU, I/O, memory, or network. In this case, mis‑configured log rotation combined with unchecked debug logging created an I/O bottleneck that cascaded to system‑wide latency. Incorporating the described SOP and the 60‑second rule helps teams resolve similar incidents quickly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance I/O Linux Troubleshooting Log Management Server Load

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Background

1. First Reaction

2. Global Scan – Qualitative Bottleneck Identification

2.1 top – CPU time distribution

2.2 vmstat – Confirm blocking

2.3 free – Memory and swap

2.4 First‑stage Summary

3. Locating the I/O Source

3.1 iostat – Disk activity

3.2 iotop – Process‑level I/O

3.3 pidstat -d – Per‑process I/O

3.4 File‑level verification

4. Root‑Cause Confirmation

4.1 Findings

4.2 Immediate Recovery Steps

4.3 Long‑Term Fixes

5. Additional Production‑Level Cases

Case A – MySQL double‑write causing disk saturation

Case B – TIME_WAIT accumulation

Case C – unattended‑upgrades I/O spike on a small VM

6. Netflix “60‑Second Rule” in Practice

7. Production‑Level Investigation Guidelines

8. Conclusion

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Case A – MySQL double‑write causing disk saturation

Case B – TIME_WAIT accumulation

Case C – unattended‑upgrades I/O spike on a small VM