How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide
This article presents a comprehensive, step‑by‑step methodology for troubleshooting and optimizing large‑language‑model inference performance, covering GPU, CPU, memory, network, configuration, and application layers, with concrete benchmark scripts, diagnostic commands, and real‑world case studies.
Overview
The guide explains why inference latency can degrade even when GPUs appear idle, emphasizing a full‑stack diagnostic flow that starts from establishing a performance baseline and then drills down through six layers: GPU, CPU, memory, network, configuration, and application.
Baseline Methodology
Before any investigation, run a reproducible benchmark (vLLM benchmark_baseline.py) to capture key metrics such as TTFT P50/P99, TPS, GPU utilization, memory usage, and KV‑Cache occupancy. The baseline for a Qwen3.5‑35B‑A3B‑FP8 model on two H100 80GB cards is documented in a JSON table and serves as the reference for all later comparisons.
Step‑by‑Step Investigation
1. GPU Layer
Use nvidia‑smi and dcgmi to check SM utilization, memory bandwidth, power draw, clock throttling, ECC errors, and PCIe link generation/width.
Key commands: nvidia‑smi dmon -s pucvmet -d 1, nvidia‑smi -q -d PERFORMANCE, nvidia‑smi topo -m, dcgmi diag -r 1.
Detect thermal or power throttling (HW Thermal Slowdown, HW Power Brake) and PCIe downgrade (e.g., Gen3 x8 instead of Gen5 x16).
2. CPU Layer
Profile tokenizer and preprocessing using perf top or sudo perf top -p $(pgrep -f vllm.entrypoints).
Check NUMA affinity with numactl --hardware and nvidia‑smi topo -m, then bind the vLLM process to the correct NUMA node.
Monitor CPU usage per core with mpstat -P ALL 1 5 to spot single‑thread bottlenecks caused by the GIL or heavy preprocessing.
3. Memory Layer
Verify KV‑Cache configuration: gpu‑memory‑utilization, max‑model‑len, and max‑num‑seqs. Mis‑configurations cause either OOM or excessive preemption.
Inspect KV‑Cache usage via
curl http://localhost:8000/metrics | grep vllm_gpu_cache_usage_perc. Frequent “Preempted … sequences” warnings indicate cache fragmentation.
Ensure system memory bandwidth is adequate (e.g., DDR5‑4800 ≈ 300 GB/s) using mbw -n 5 256 and disable swap ( sudo swapoff -a).
4. Network Layer
Measure model‑load latency (local SSD vs NFS) with dd if=… of=/dev/null bs=1M count=1024.
Check inter‑GPU communication bandwidth with NCCL tests ( nccl‑tests/all_reduce_perf) and PCIe throughput with bandwidthTest.
Validate client‑to‑service latency using hping3 and
curl -w "tcp=%{time_connect}s, ttfb=%{time_starttransfer}s" http://host:8000/health.
5. Configuration Layer
Common pitfalls: overly large max_num_seqs, mismatched tensor‑parallel‑size, wrong quantization flags, or missing chunked‑prefill.
Recommended defaults for the example model: gpu‑memory‑utilization=0.92, max‑model‑len=16384, max‑num‑seqs=64, enable --enable‑chunked‑prefill, and set --tokenizer‑pool‑size if needed.
6. Application Layer
Analyze request patterns (prompt length distribution, max_tokens) from logs; sudden spikes in long prompts can inflate TTFT.
Implement rate‑limiting, request pre‑warming, or speculative decoding to mitigate burst traffic.
Practical Scripts
Several ready‑to‑run Bash and Python scripts are provided: gpu_baseline_check.sh – quick health check of GPUs, clocks, PCIe, ECC, and topology. benchmark_baseline.py – Python asyncio benchmark that records TTFT, TPS, latency percentiles, and writes results to JSON. diagnose_inference.sh – one‑click collector that gathers GPU, CPU, memory, network, vLLM metrics, and recent logs into a single report.
Case Studies
Case 1: PCIe Bottleneck
GPU 0 was operating at PCIe Gen3 x8, causing TTFT P99 to rise from 1.2 s to 3.5 s despite 80 % SM utilization. Re‑seating the card restored Gen5 x16, bringing latency back to baseline.
Case 2: KV‑Cache Fragmentation
After five days of continuous load, TPS dropped from 610 tokens/s to 420 tokens/s while TTFT stayed stable. Logs showed repeated “Preempted … sequences” warnings. Restarting vLLM cleared fragmentation; a weekly restart or enabling swap‑mode prevented recurrence.
Best Practices & Recommendations
Prefer FP8 on Hopper GPUs for minimal accuracy loss and highest throughput; use AWQ or GPTQ on older GPUs.
Enable --enable‑chunked‑prefill for online services to reduce TTFT at the cost of a small TPS drop.
Consider speculative decoding with a small draft model for tasks with predictable token patterns.
Deploy multiple vLLM instances behind an Nginx least_conn upstream for load balancing and redundancy.
Monitor key Prometheus metrics: vllm_time_to_first_token_seconds_bucket, vllm_generation_tokens_total, vllm_gpu_cache_usage_perc, vllm_preemption_count_total, and hardware metrics from DCGM.
Set dynamic alert rules comparing current values to a 7‑day offset to catch regressions early.
Capacity Planning
Use the empirical formula
required_GPUs = (target_QPS × avg_E2E_latency) / per_GPU_QPS. For the example model, a single H100 TP=2 instance handles ~14 QPS (P99 < 3 s). To sustain 100 QPS, allocate 8 instances (16 GPUs) with 25 % headroom, or use TP=4 instances for fewer but larger servers.
Summary
The guide demonstrates that inference slowdown is rarely a single‑component issue; a systematic, layered approach—starting from a solid baseline, inspecting hardware health, profiling CPU work, validating memory configuration, checking network paths, and finally reviewing application patterns—enables rapid identification and remediation of performance regressions.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
