13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

21CTO

Apr 23, 2024

Deploy Large Language Models with vLLM and Quantization for Low Latency

Introduction

Large language models (LLMs) such as ChatGPT, GPT‑4, Claude, Llama, Mistral, Falcon and Gemma are increasingly used across education, healthcare, art and business. Deploying these models can be challenging because they are computationally intensive and require powerful GPUs for real‑time inference.

Latency and Throughput

Model performance is typically measured by latency (time to generate a response) and throughput (tokens generated per second). Larger models need more GPU resources, and both metrics depend on model size, GPU hardware, and input length.

Latency: time required for the model to produce a response, measured in seconds or milliseconds.

Throughput: number of tokens generated per second or millisecond.

Required Packages

pip3 install transformers</code><code>pip3 install accelerate

What is Phi‑2?

Phi‑2 is a Microsoft‑provided 2.7‑billion‑parameter foundation model trained on diverse data sources ranging from code to textbooks.

Benchmarking with Hugging Face Transformers

Latency: 2.739394464492798 seconds</code><code>Throughput: 32.36171766303386 tokens/second</code><code>def sum_list(numbers):</code><code>    total = 0</code><code>    for num in numbers:</code><code>        total += num</code><code>    return total</code><code>print(sum_list([1, 2, 3, 4, 5]))

Lines 6‑10 load the Phi‑2 model and prompt it to generate Python code that sums a list of numbers. Lines 12‑18 measure the response time (latency). Lines 21‑23 compute throughput by dividing token count by latency. The model runs on an A1000 (16 GB) GPU, achieving 2.7 s latency and 32 tokens/s throughput.

Using vLLM for Faster Inference

vLLM is an open‑source LLM serving library that provides low latency and high throughput by introducing a memory‑efficient attention mechanism called PagedAttention . It is especially useful for large models on limited GPU resources.

Install vLLM

pip3 install vllm==0.3.3

Run Phi‑2 with vLLM

Latency: 1.218436622619629 seconds</code><code>Throughput: 63.15334836428132 tokens/second</code><code>def sum_list(numbers):</code><code>    total = 0</code><code>    for num in numbers:</code><code>        total += num</code><code>    return total</code><code>numbers = [1, 2, 3, 4, 5]</code><code>print(sum_list(numbers))

Using the same prompt on the same GPU, vLLM reduces latency to 1.2 s and more than doubles throughput to 63 tokens/s while producing identical results.

Real‑time Benchmarking Considerations

In chat‑based systems, latency includes the time to generate the first token and the time per subsequent token. Factors such as input sequence length, expected output length, and model size all affect these metrics. Batching multiple user requests can improve throughput.

Quantization for GPU Efficiency

Quantization reduces model weights to lower‑precision formats (8‑bit or 4‑bit), decreasing memory usage and enabling inference on smaller GPUs. Tools like BitsandBytes provide custom quantization functions.

Install BitsandBytes

pip3 install bitsandbytes

Quantizing Mistral‑7B (70 B parameters)

8‑bit quantization loads the model with load_in_8bit=True. 4‑bit quantization uses load_in_4bit=True and introduces additional parameters such as bnb_4bit_compute_dtype (bfloat16) for faster inference.

Set load_in_8bit or load_in_4bit accordingly.

For 4‑bit, specify bnb_4bit_quant_type='nf4' and enable double quantization with bnb_4bit_use_double_quant=True.

NF4 and Double Quantization

NF4 (Normal Float 4) is a quantization method that, combined with double quantization, yields better accuracy than standard 4‑bit quantization. It applies a second quantization step to the already quantized weights, optimizing the floating‑point range.

Conclusion

This article provides a step‑by‑step method to measure LLM performance, explains how vLLM works, and demonstrates how quantization techniques enable large models to run efficiently on modest GPU hardware.

Python Quantization large language models vLLM performance benchmarking LLM deployment

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Latency and Throughput

Required Packages

What is Phi‑2?

Benchmarking with Hugging Face Transformers

Using vLLM for Faster Inference

Install vLLM

Run Phi‑2 with vLLM

Real‑time Benchmarking Considerations

Quantization for GPU Efficiency

Install BitsandBytes

Quantizing Mistral‑7B (70 B parameters)

NF4 and Double Quantization

Conclusion

21CTO

How this landed with the community

Was this worth your time?

0 Comments

Quantizing Mistral‑7B (70 B parameters)