Master Linux Performance Troubleshooting in the First 60 Seconds
This guide walks you through the ten essential Linux command‑line tools that Netflix’s performance team uses to quickly assess system health, focusing on error and saturation metrics before utilization, so you can pinpoint and resolve server issues within the critical first minute.
When you encounter a Linux server performance issue, the first minute is critical; here are the ten standard Linux command‑line tools Netflix’s performance team uses to diagnose problems within 60 seconds.
We first examine metrics related to errors and resource saturation, then look at utilization, following the USE method (Utilization, Saturation, Errors) across CPU, memory, disk, and network.
1. uptime
$ uptimeShows the system’s average load over the last 1, 5, and 15 minutes, indicating how many tasks are waiting for CPU or I/O. A sudden jump in the 1‑minute load compared to the 15‑minute load suggests a recent issue.
2. dmesg | tail
$ dmesg | tailDisplays the most recent kernel messages. Look for OOM‑killer events, device errors, or TCP issues that could affect performance. Always check dmesg first.
3. vmstat 1
$ vmstat 1Shows virtual memory, CPU, and I/O statistics refreshed every second. Key fields:
r : runnable tasks (CPU saturation indicator).
free : free memory in KB.
si, so : pages swapped in/out (memory pressure).
us, sy, id, wa, st : CPU time spent in user, system, idle, I/O wait, and steal.
High wa indicates I/O bottlenecks; high us+sy (>90%) shows CPU is busy.
4. mpstat -P ALL 1
$ mpstat -P ALL 1Prints per‑CPU utilization each second. Uneven usage may reveal single‑threaded workloads.
5. pidstat 1
$ pidstat 1Shows CPU usage per process at one‑second intervals, useful for spotting runaway processes (e.g., Java threads consuming many CPUs).
6. iostat -xz 1
$ iostat -xz 1Provides detailed block‑device statistics. Important metrics:
r/s, w/s, rkB/s, wkB/s : read/write request rates and throughput.
await : average I/O response time (high values indicate saturation).
avgqu‑sz : average queue length (values >1 suggest bottleneck).
%util : device utilization ( >60% often problematic).
Remember logical devices may mask underlying physical saturation.
7. free -m
$ free -mShows memory usage in megabytes. Pay attention to the “buffers” and “cached” columns; the “-/+ buffers/cache” line gives a more accurate view of used memory.
8. sar -n DEV 1
$ sar -n DEV 1Monitors network interface throughput (rxkB/s, txkB/s) and utilization (%ifutil). High traffic or %ifutil near 100% may indicate network bottlenecks.
9. sar -n TCP,ETCP 1
$ sar -n TCP,ETCP 1Shows TCP statistics such as active connections, passive connections, and retransmissions. Frequent retransmissions point to network or server issues.
10. top
$ topAggregates many of the above metrics in real time, displaying process‑level CPU and memory usage. Use it to verify whether the system state has changed since the earlier checks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
