Operations 20 min read

How Netflix Engineers Diagnose Linux Performance Issues in the First 60 Seconds

When a Linux server shows performance problems, Netflix's performance team recommends using ten standard CLI tools within the first minute to capture key utilization, saturation, and error metrics, enabling rapid root‑cause identification before deeper analysis is needed.

Liangxu Linux

Nov 3, 2024

How Netflix Engineers Diagnose Linux Performance Issues in the First 60 Seconds

When you suspect a Linux server is under‑performing, the first 60 seconds are critical for gathering high‑level metrics that reveal whether the problem lies in CPU, memory, disk, or network. Netflix’s performance engineering team outlines a repeatable workflow using ten standard command‑line utilities that run on any Linux box.

1. uptime

The uptime command shows the system’s load averages for the past 1, 5, and 15 minutes. A load value higher than the number of CPU cores indicates saturation. Comparing the 1‑minute load to the 15‑minute load helps pinpoint when the issue began.

$ uptime
 14:23:01 up 12 days,  3:45,  2 users,  load average: 1.32, 0.97, 0.78

2. dmesg | tail

Shows the most recent kernel messages, useful for spotting OOM kills, hardware errors, or network anomalies.

$ dmesg | tail
[1880957.563150] perl invoked oom‑killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.

3. vmstat 1

Provides per‑second snapshots of virtual memory, CPU, and I/O statistics. Important fields include r (runnable tasks), free memory, si/so (swap activity), and CPU time breakdown ( us, sy, id, wa, st ).

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0

4. mpstat -P ALL 1

Displays CPU usage per core every second, helping to identify uneven load or a single‑threaded bottleneck.

$ mpstat -P ALL 1
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99

5. pidstat 1

Shows per‑process CPU usage over time, similar to top but without screen clearing. The %CPU column can exceed 100 % on multi‑core systems, indicating how many cores a process is consuming.

$ pidstat 1
07:41:02 PM   UID   PID   %usr %system %CPU   Command
07:41:03 PM   0   6521 1596.23  1.89 1598.11 java

6. iostat -xz 1

Reports block‑device statistics. Key metrics are r/s, w/s, rkB/s, wkB/s (throughput), await (average I/O latency), avgqu‑sz (average queue length), and %util (device utilization). Values above 60 % utilization or high await indicate disk saturation.

$ iostat -xz 1
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq‑sz avgqu‑sz await r_await w_await svctm %util
xvda   0.00   0.23 0.21 0.18 4.52 2.08 34.37   0.00   9.98   13.80   5.42 2.44 0.09

7. free -m

Shows memory usage in megabytes. The line -/+ buffers/cache gives a more accurate view of memory actually used by applications, because Linux repurposes unused RAM for cache.

$ free -m
              total   used   free   shared  buffers  cached
Mem:          245998  24545 221453     83      59    541
-/+ buffers/cache: 23944 222053

8. sar -n DEV 1

Monitors network interface throughput. Columns rxkB/s and txkB/s indicate inbound/outbound traffic; %ifutil shows interface utilization.

$ sar -n DEV 1
12:16:49 AM IFACE   rxpck/s txpck/s  rxkB/s  txkB/s %ifutil
12:16:49 AM eth0    18763   5032   20686.42  478.30   0.00

9. sar -n TCP,ETCP 1

Provides TCP‑level statistics: active/s (outbound connections per second), passive/s (inbound connections per second), and retrans/s (retransmissions per second). Sudden spikes can signal network or server overload.

$ sar -n TCP,ETCP 1
12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:20 AM   1.00    0.00   10233   18846

10. top

Combines many of the above metrics in a dynamic view, showing overall load average, per‑process CPU and memory usage, and swap activity. It is useful for a quick sanity check but can miss transient spikes because the screen refreshes.

$ top
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 2.7 id, 0.1 wa
KiB Mem: 25190241+ total, 24921688 used, 22698073+ free

The overall workflow follows the USE method (Utilization → Saturation → Errors). First, check utilization with uptime and top. Then examine saturation using vmstat, iostat, and sar. Finally, look for errors via dmesg and network‑level TCP stats. By running these commands within the first minute, engineers can quickly narrow the investigation scope and decide which deeper diagnostic steps are required.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux system-administration CLI Tools

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.