Step‑by‑Step Investigation of a High‑Load Production Server
During a mid‑year promotion an e‑commerce platform experienced a sudden spike in load average and response latency; the article walks through a systematic, command‑driven investigation that identifies an I/O bottleneck caused by mis‑configured log rotation and excessive debug logging, and presents immediate and long‑term remediation steps.
Problem Background
In June 2025, an e‑commerce platform’s web servers (24‑core CPU, 32 GB RAM) triggered a CPU load alert at 14:30 during a mid‑year promotion. Load average jumped from 5‑8 to over 42, while P99 response time rose from 200 ms to 3800 ms, causing timeouts and 502 errors.
1. First Reaction
Run uptime to see load average 42.35 / 28.17 / 15.42. On a 24‑core machine this means ~1.76 runnable tasks per core, indicating saturation.
2. Global Scan – Qualitative Bottleneck Identification
2.1 top – CPU time distribution
$ top
...
%Cpu(s): 8.5 us, 5.2 sy, 32.1 id, 53.8 waThe key finding is I/O wait (wa) = 53.8 %, while idle (id) = 32.1 % and user+system only 13.7 %, showing an I/O bottleneck.
2.2 vmstat – Confirm blocking
$ vmstat 1 5
...
b 38 # blocked (uninterruptible) processes
r 3 # run queue length38 processes are blocked on I/O; CPU run queue is short, confirming the bottleneck lies in the I/O subsystem.
2.3 free – Memory and swap
$ free -h -w
Mem: 31G total, 14G used, 13G available
Swap: 4G total, 3G usedMemory is sufficient, but swap usage indicates some pressure.
2.4 First‑stage Summary
Bottleneck type: I/O (wa = 53.8 %, b = 38)
Resource status: CPU has headroom, memory OK, modest swap
Next step: Identify which process generates the heavy I/O.
3. Locating the I/O Source
3.1 iostat – Disk activity
$ iostat -xdm 1 3
Device w/s wkB/s w_await %util
vda 4500.3 360000 112.3 98.74500 writes per second, average write latency 112 ms, and 98.7 % utilization indicate a saturated SSD.
3.2 iotop – Process‑level I/O
$ iotop -o
... java 320 MB/s write, 95.2 % I/O wait
... nginx 3.84 MB/s readThe Java process accounts for almost all disk writes.
3.3 pidstat -d – Per‑process I/O
$ pidstat -d 1 3
PID kB_wr/s Command
3456 320000 java3.4 File‑level verification
Using lsof shows the Java process keeps file descriptors open for app.log and app.log.1. The latter is a rotated log that is still being written to because the process holds the old file handle.
4. Root‑Cause Confirmation
4.1 Findings
Debug logging was enabled during the promotion, increasing log volume from ~20 MB/min to ~20 GB/min.
logrotate used the create option, which renames the current log and creates a new file, leaving the Java process writing to the renamed app.log.1.
The massive write volume plus double‑file writes saturated the disk, causing system‑wide I/O blockage and load spikes.
4.2 Immediate Recovery Steps
Force the Java process to release the deleted file handle (e.g., send USR1 to reload log4j2, or truncate the fd via : > /proc/<PID>/fd/<FD_NUMBER>).
Switch the log level back to WARN/INFO.
Verify I/O normalization with iostat and uptime.
4.3 Long‑Term Fixes
Change logrotate to use copytruncate so the process continues writing to the same file.
Enable asynchronous logging (log4j2 AsyncAppender or logback AsyncAppender).
Separate /var/log onto its own partition.
Formalize log‑level change approvals and automatic rollback.
5. Additional Production‑Level Cases
Case A – MySQL double‑write causing disk saturation
High‑load MySQL with innodb_flush_log_at_trx_commit=1 and sync_binlog=1 leads to >70 % I/O wait; temporary mitigation by lowering durability, long‑term by using higher‑IOPS disks and group commit.
Case B – TIME_WAIT accumulation
Short‑lived connections fill the TCP TIME_WAIT table (≈62 000 entries), causing connection refusals; immediate relief by enabling tcp_tw_reuse and reducing tcp_fin_timeout, long‑term by using connection pooling.
Case C – unattended‑upgrades I/O spike on a small VM
Ubuntu’s unattended-upgrades triggers heavy disk reads; stop the timer or disable the service to prevent I/O saturation.
6. Netflix “60‑Second Rule” in Practice
The investigation follows the Netflix performance‑engineering “60‑second rule”: run a predefined set of commands within the first minute after login to obtain a complete system snapshot. The command list (uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, top, ss) is reproduced in the article.
7. Production‑Level Investigation Guidelines
Do not blindly reboot a high‑load system; preserve evidence.
Collect all relevant metrics before taking corrective actions.
Base every judgment on concrete data (e.g., wa = 53.8 %).
Apply changes gradually (gray‑scale) on a subset of nodes.
Backup configuration files before modification.
Record who performed what operation and when.
8. Conclusion
Effective load‑high troubleshooting is about pinpointing the true bottleneck—CPU, I/O, memory, or network. In this case, mis‑configured log rotation combined with unchecked debug logging created an I/O bottleneck that cascaded to system‑wide latency. Incorporating the described SOP and the 60‑second rule helps teams resolve similar incidents quickly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
