How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

Overview

When upgrading a large‑model inference service, our team found that eight H800 GPUs running Llama‑3‑70B with vLLM could only achieve 180 tokens/s, far below the 300 tokens/s SLA. After switching to SGLang, throughput jumped to 420 tokens/s and latency dropped by 40 %, prompting a deep dive into SGLang’s design.

Key Advantages of SGLang

RadixAttention : Uses a radix tree to share KV‑cache across requests with common prefixes, dramatically speeding up multi‑turn dialogue and few‑shot scenarios.

Aggressive continuous batch scheduling : Handles prefill and decode phases in a single forward pass, unlike vLLM which waits between phases.

FlashInfer backend : Deeply integrates the FlashInfer CUDA kernel library, delivering 15‑20 % speed‑ups over FlashAttention‑2 on SM90 (H100/H800) architectures.

Environment Requirements

Before installing, ensure the following minimum versions (recommended versions in parentheses):

CUDA >= 12.1 (12.4 recommended)
Python >= 3.9 (3.11 recommended)
PyTorch >= 2.1.0 (2.4.0 recommended)
NVIDIA driver >= 525.x (550.x recommended)
SGLang >= 0.2.0 (0.3.x latest)

Note that the H800’s 3.35 TB/s HBM3 bandwidth is fully utilized only with driver 550.x; using 525.x caps bandwidth at ~2.8 TB/s.

Detailed Deployment Steps

1. Environment Preparation

We recommend using Docker in production, but a Conda environment works for testing:

# Create a dedicated Conda env
conda create -n sglang python=3.11 -y
conda activate sglang

# Install PyTorch for CUDA 12.4
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124

# Install SGLang (includes FlashInfer)
pip install "sglang[all]"

# Install FlashInfer separately if you want the backend
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

Be careful to match FlashInfer wheels with the exact CUDA and PyTorch versions; mismatches cause kernel‑load failures.

2. Verify GPU Status

# List GPUs
nvidia-smi -L

# Show NVLink topology (required for H800 multi‑GPU)
nvidia-smi topo -m

# Check GPU clock frequencies
nvidia-smi -q -d CLOCK

If topo -m shows PIX or PHB links instead of NV18, the PCIe topology is sub‑optimal and must be fixed.

3. Launch SGLang Service

Basic launch command:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-70B-Instruct \
    --tp 8 \
    --port 30000 \
    --host 0.0.0.0

For production, use the following tuned parameters:

python -m sglang.launch_server \
    --model-path /data/models/Meta-Llama-3-70B-Instruct \
    --tp 8 \
    --port 30000 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.88 \
    --max-running-requests 256 \
    --max-total-tokens 131072 \
    --chunked-prefill-size 8192 \
    --schedule-policy lpm \
    --enable-flashinfer \
    --attention-backend flashinfer \
    --sampling-backend flashinfer \
    --enable-torch-compile \
    --dtype bfloat16 \
    --trust-remote-code

Explanation of critical flags:

--mem-fraction-static 0.88 : Statically allocate 88 % of GPU memory to KV‑cache. H800 has 80 GB; model weights consume ~17.5 GB per card, leaving the rest for cache. Setting this too high causes OOM, too low wastes memory.

--max-running-requests 256 : Upper bound on concurrent requests; choose a power of two for CUDA alignment.

--chunked-prefill-size 8192 : Splits long prompts into 8 k token chunks to reduce first‑token latency.

--schedule-policy lpm : Longest Prefix Match works best with RadixAttention.

4. Production‑Ready Scripts

Example Bash script ( sglang_server.sh) sets environment variables, launches the server, and logs output:

#!/bin/bash
set -e
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NCCL_DEBUG=WARN
export NCCL_NVLS_ENABLE=1
export NCCL_ALGO=Ring
export NCCL_BUFFSIZE=8388608
export SGLANG_FLASHINFER_NUM_STREAMS=4

MODEL_PATH="/data/models/Meta-Llama-3-70B-Instruct"
LOG_DIR="/var/log/sglang"
mkdir -p $LOG_DIR

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp 8 \
    --port 30000 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.88 \
    --max-running-requests 256 \
    --max-total-tokens 131072 \
    --chunked-prefill-size 8192 \
    --schedule-policy lpm \
    --schedule-conservativeness 0.8 \
    --enable-flashinfer \
    --attention-backend flashinfer \
    --sampling-backend flashinfer \
    --kv-cache-dtype fp8_e5m2 \
    --enable-mixed-chunk \
    --enable-torch-compile \
    --dtype bfloat16 \
    --trust-remote-code \
    --log-level info 2>&1 | tee -a $LOG_DIR/sglang_$(date +%Y%m%d).log

A Systemd unit file can manage the service, and an Nginx reverse‑proxy configuration exposes the HTTP API with long timeouts for streaming generation.

Troubleshooting & Monitoring

Common Issues

FlashInfer version mismatch : Reinstall the wheel that matches your CUDA and PyTorch versions.

# Check CUDA version
nvcc --version
# Check PyTorch CUDA version
python -c "import torch; print(torch.version.cuda)"
# Reinstall correct FlashInfer wheel
pip uninstall flashinfer -y
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

OOM : Reduce --mem-fraction-static (e.g., from 0.88 to 0.80) and monitor memory with nvidia-smi dmon -s u -d 1.

Low GPU utilization : Increase --max-running-requests (e.g., to 512) or check CPU bottlenecks with top -H -p $(pgrep -f sglang).

NCCL timeout : Export a larger timeout ( export NCCL_TIMEOUT=1800) and verify network interfaces.

Metrics & Alerts

SGLang exposes a Prometheus endpoint at /metrics. Example scrape config:

scrape_configs:
  - job_name: 'sglang'
    static_configs:
      - targets: ['localhost:30000']
    metrics_path: /metrics
    scrape_interval: 15s

Key metrics to watch:

sglang_num_running_requests : Should stay below 90 % of max-running-requests.

sglang_token_throughput : Should be above 80 % of the expected throughput.

sglang_avg_latency_ms : Must meet SLA requirements.

sglang_cache_hit_rate : Should stay above 0.5; low values indicate RadixAttention isn’t effective.

Performance Comparison

Benchmarks were run on 8 × NVIDIA H800 (80 GB) with Llama‑3‑70B‑Instruct, 512‑token prompts, and 256‑token generations.

Framework   Config          Throughput (tokens/s)   P50 Latency (ms)   P99 Latency (ms)
--------------------------------------------------------------------------------
vLLM 0.5.x  Default         185                     892                2340
vLLM 0.5.x  Tuned           278                     645                1820
SGLang     Default         312                     520                1450
SGLang     This article    425                     380                980

With RadixAttention enabled, few‑shot throughput rose from 245 tokens/s to 580 tokens/s and first‑token latency dropped from 850 ms to 180 ms (78 % reduction).

Conclusion

SGLang proves to be one of the fastest open‑source LLM inference frameworks on H800 GPUs. Its main performance drivers are:

RadixAttention – shared KV‑cache for common prefixes.

FlashInfer backend – hardware‑aware CUDA kernels.

FP8 KV‑cache – doubles effective context length with negligible accuracy loss.

Careful batch‑size and scheduling parameters tuned to workload characteristics.

Future work includes exploring speculative decoding, multi‑node deployments, and LoRA hot‑loading.

LLM inferenceGPU OptimizationSGLangFlashInferH800RadixAttention
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.