How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide

This article presents a comprehensive, step‑by‑step methodology for troubleshooting and optimizing large‑language‑model inference performance, covering GPU, CPU, memory, network, configuration, and application layers, with concrete benchmark scripts, diagnostic commands, and real‑world case studies.

Ops Community
Ops Community
Ops Community
How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide

Overview

The guide explains why inference latency can degrade even when GPUs appear idle, emphasizing a full‑stack diagnostic flow that starts from establishing a performance baseline and then drills down through six layers: GPU, CPU, memory, network, configuration, and application.

Baseline Methodology

Before any investigation, run a reproducible benchmark (vLLM benchmark_baseline.py) to capture key metrics such as TTFT P50/P99, TPS, GPU utilization, memory usage, and KV‑Cache occupancy. The baseline for a Qwen3.5‑35B‑A3B‑FP8 model on two H100 80GB cards is documented in a JSON table and serves as the reference for all later comparisons.

Step‑by‑Step Investigation

1. GPU Layer

Use nvidia‑smi and dcgmi to check SM utilization, memory bandwidth, power draw, clock throttling, ECC errors, and PCIe link generation/width.

Key commands: nvidia‑smi dmon -s pucvmet -d 1, nvidia‑smi -q -d PERFORMANCE, nvidia‑smi topo -m, dcgmi diag -r 1.

Detect thermal or power throttling (HW Thermal Slowdown, HW Power Brake) and PCIe downgrade (e.g., Gen3 x8 instead of Gen5 x16).

2. CPU Layer

Profile tokenizer and preprocessing using perf top or sudo perf top -p $(pgrep -f vllm.entrypoints).

Check NUMA affinity with numactl --hardware and nvidia‑smi topo -m, then bind the vLLM process to the correct NUMA node.

Monitor CPU usage per core with mpstat -P ALL 1 5 to spot single‑thread bottlenecks caused by the GIL or heavy preprocessing.

3. Memory Layer

Verify KV‑Cache configuration: gpu‑memory‑utilization, max‑model‑len, and max‑num‑seqs. Mis‑configurations cause either OOM or excessive preemption.

Inspect KV‑Cache usage via

curl http://localhost:8000/metrics | grep vllm_gpu_cache_usage_perc

. Frequent “Preempted … sequences” warnings indicate cache fragmentation.

Ensure system memory bandwidth is adequate (e.g., DDR5‑4800 ≈ 300 GB/s) using mbw -n 5 256 and disable swap ( sudo swapoff -a).

4. Network Layer

Measure model‑load latency (local SSD vs NFS) with dd if=… of=/dev/null bs=1M count=1024.

Check inter‑GPU communication bandwidth with NCCL tests ( nccl‑tests/all_reduce_perf) and PCIe throughput with bandwidthTest.

Validate client‑to‑service latency using hping3 and

curl -w "tcp=%{time_connect}s, ttfb=%{time_starttransfer}s" http://host:8000/health

.

5. Configuration Layer

Common pitfalls: overly large max_num_seqs, mismatched tensor‑parallel‑size, wrong quantization flags, or missing chunked‑prefill.

Recommended defaults for the example model: gpu‑memory‑utilization=0.92, max‑model‑len=16384, max‑num‑seqs=64, enable --enable‑chunked‑prefill, and set --tokenizer‑pool‑size if needed.

6. Application Layer

Analyze request patterns (prompt length distribution, max_tokens) from logs; sudden spikes in long prompts can inflate TTFT.

Implement rate‑limiting, request pre‑warming, or speculative decoding to mitigate burst traffic.

Practical Scripts

Several ready‑to‑run Bash and Python scripts are provided: gpu_baseline_check.sh – quick health check of GPUs, clocks, PCIe, ECC, and topology. benchmark_baseline.py – Python asyncio benchmark that records TTFT, TPS, latency percentiles, and writes results to JSON. diagnose_inference.sh – one‑click collector that gathers GPU, CPU, memory, network, vLLM metrics, and recent logs into a single report.

Case Studies

Case 1: PCIe Bottleneck

GPU 0 was operating at PCIe Gen3 x8, causing TTFT P99 to rise from 1.2 s to 3.5 s despite 80 % SM utilization. Re‑seating the card restored Gen5 x16, bringing latency back to baseline.

Case 2: KV‑Cache Fragmentation

After five days of continuous load, TPS dropped from 610 tokens/s to 420 tokens/s while TTFT stayed stable. Logs showed repeated “Preempted … sequences” warnings. Restarting vLLM cleared fragmentation; a weekly restart or enabling swap‑mode prevented recurrence.

Best Practices & Recommendations

Prefer FP8 on Hopper GPUs for minimal accuracy loss and highest throughput; use AWQ or GPTQ on older GPUs.

Enable --enable‑chunked‑prefill for online services to reduce TTFT at the cost of a small TPS drop.

Consider speculative decoding with a small draft model for tasks with predictable token patterns.

Deploy multiple vLLM instances behind an Nginx least_conn upstream for load balancing and redundancy.

Monitor key Prometheus metrics: vllm_time_to_first_token_seconds_bucket, vllm_generation_tokens_total, vllm_gpu_cache_usage_perc, vllm_preemption_count_total, and hardware metrics from DCGM.

Set dynamic alert rules comparing current values to a 7‑day offset to catch regressions early.

Capacity Planning

Use the empirical formula

required_GPUs = (target_QPS × avg_E2E_latency) / per_GPU_QPS

. For the example model, a single H100 TP=2 instance handles ~14 QPS (P99 < 3 s). To sustain 100 QPS, allocate 8 instances (16 GPUs) with 25 % headroom, or use TP=4 instances for fewer but larger servers.

Summary

The guide demonstrates that inference slowdown is rarely a single‑component issue; a systematic, layered approach—starting from a solid baseline, inspecting hardware health, profiling CPU work, validating memory configuration, checking network paths, and finally reviewing application patterns—enables rapid identification and remediation of performance regressions.

debuggingPerformancevLLMCPUGPUInference
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.