Deploy Large Language Models with vLLM and Quantization for Low Latency
This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.
Introduction
Large language models (LLMs) such as ChatGPT, GPT‑4, Claude, Llama, Mistral, Falcon and Gemma are increasingly used across education, healthcare, art and business. Deploying these models can be challenging because they are computationally intensive and require powerful GPUs for real‑time inference.
Latency and Throughput
Model performance is typically measured by latency (time to generate a response) and throughput (tokens generated per second). Larger models need more GPU resources, and both metrics depend on model size, GPU hardware, and input length.
Latency: time required for the model to produce a response, measured in seconds or milliseconds.
Throughput: number of tokens generated per second or millisecond.
Required Packages
pip3 install transformers</code><code>pip3 install accelerateWhat is Phi‑2?
Phi‑2 is a Microsoft‑provided 2.7‑billion‑parameter foundation model trained on diverse data sources ranging from code to textbooks.
Benchmarking with Hugging Face Transformers
Latency: 2.739394464492798 seconds</code><code>Throughput: 32.36171766303386 tokens/second</code><code>def sum_list(numbers):</code><code> total = 0</code><code> for num in numbers:</code><code> total += num</code><code> return total</code><code>print(sum_list([1, 2, 3, 4, 5]))Lines 6‑10 load the Phi‑2 model and prompt it to generate Python code that sums a list of numbers. Lines 12‑18 measure the response time (latency). Lines 21‑23 compute throughput by dividing token count by latency. The model runs on an A1000 (16 GB) GPU, achieving 2.7 s latency and 32 tokens/s throughput.
Using vLLM for Faster Inference
vLLM is an open‑source LLM serving library that provides low latency and high throughput by introducing a memory‑efficient attention mechanism called PagedAttention . It is especially useful for large models on limited GPU resources.
Install vLLM
pip3 install vllm==0.3.3Run Phi‑2 with vLLM
Latency: 1.218436622619629 seconds</code><code>Throughput: 63.15334836428132 tokens/second</code><code>def sum_list(numbers):</code><code> total = 0</code><code> for num in numbers:</code><code> total += num</code><code> return total</code><code>numbers = [1, 2, 3, 4, 5]</code><code>print(sum_list(numbers))Using the same prompt on the same GPU, vLLM reduces latency to 1.2 s and more than doubles throughput to 63 tokens/s while producing identical results.
Real‑time Benchmarking Considerations
In chat‑based systems, latency includes the time to generate the first token and the time per subsequent token. Factors such as input sequence length, expected output length, and model size all affect these metrics. Batching multiple user requests can improve throughput.
Quantization for GPU Efficiency
Quantization reduces model weights to lower‑precision formats (8‑bit or 4‑bit), decreasing memory usage and enabling inference on smaller GPUs. Tools like BitsandBytes provide custom quantization functions.
Install BitsandBytes
pip3 install bitsandbytesQuantizing Mistral‑7B (70 B parameters)
8‑bit quantization loads the model with load_in_8bit=True. 4‑bit quantization uses load_in_4bit=True and introduces additional parameters such as bnb_4bit_compute_dtype (bfloat16) for faster inference.
Set load_in_8bit or load_in_4bit accordingly.
For 4‑bit, specify bnb_4bit_quant_type='nf4' and enable double quantization with bnb_4bit_use_double_quant=True.
NF4 and Double Quantization
NF4 (Normal Float 4) is a quantization method that, combined with double quantization, yields better accuracy than standard 4‑bit quantization. It applies a second quantization step to the already quantized weights, optimizing the floating‑point range.
Conclusion
This article provides a step‑by‑step method to measure LLM performance, explains how vLLM works, and demonstrates how quantization techniques enable large models to run efficiently on modest GPU hardware.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
