Operations 18 min read

Master Linux Performance Troubleshooting in the First 60 Seconds

This guide walks you through the ten essential Linux command‑line tools that Netflix’s performance team uses to quickly assess system health, focusing on error and saturation metrics before utilization, so you can pinpoint and resolve server issues within the critical first minute.

Efficient Ops

Jul 12, 2022

Master Linux Performance Troubleshooting in the First 60 Seconds

When you encounter a Linux server performance issue, the first minute is critical; here are the ten standard Linux command‑line tools Netflix’s performance team uses to diagnose problems within 60 seconds.

We first examine metrics related to errors and resource saturation, then look at utilization, following the USE method (Utilization, Saturation, Errors) across CPU, memory, disk, and network.

1. uptime

$ uptime

Shows the system’s average load over the last 1, 5, and 15 minutes, indicating how many tasks are waiting for CPU or I/O. A sudden jump in the 1‑minute load compared to the 15‑minute load suggests a recent issue.

2. dmesg | tail

$ dmesg | tail

Displays the most recent kernel messages. Look for OOM‑killer events, device errors, or TCP issues that could affect performance. Always check dmesg first.

3. vmstat 1

$ vmstat 1

Shows virtual memory, CPU, and I/O statistics refreshed every second. Key fields:

r : runnable tasks (CPU saturation indicator).

free : free memory in KB.

si, so : pages swapped in/out (memory pressure).

us, sy, id, wa, st : CPU time spent in user, system, idle, I/O wait, and steal.

High wa indicates I/O bottlenecks; high us+sy (>90%) shows CPU is busy.

4. mpstat -P ALL 1

$ mpstat -P ALL 1

Prints per‑CPU utilization each second. Uneven usage may reveal single‑threaded workloads.

5. pidstat 1

$ pidstat 1

Shows CPU usage per process at one‑second intervals, useful for spotting runaway processes (e.g., Java threads consuming many CPUs).

6. iostat -xz 1

$ iostat -xz 1

Provides detailed block‑device statistics. Important metrics:

r/s, w/s, rkB/s, wkB/s : read/write request rates and throughput.

await : average I/O response time (high values indicate saturation).

avgqu‑sz : average queue length (values >1 suggest bottleneck).

%util : device utilization ( >60% often problematic).

Remember logical devices may mask underlying physical saturation.

7. free -m

$ free -m

Shows memory usage in megabytes. Pay attention to the “buffers” and “cached” columns; the “-/+ buffers/cache” line gives a more accurate view of used memory.

8. sar -n DEV 1

$ sar -n DEV 1

Monitors network interface throughput (rxkB/s, txkB/s) and utilization (%ifutil). High traffic or %ifutil near 100% may indicate network bottlenecks.

9. sar -n TCP,ETCP 1

$ sar -n TCP,ETCP 1

Shows TCP statistics such as active connections, passive connections, and retransmissions. Frequent retransmissions point to network or server issues.

10. top

$ top

Aggregates many of the above metrics in real time, displaying process‑level CPU and memory usage. Use it to verify whether the system state has changed since the earlier checks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance monitoring Ops command-line system-administration

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.