Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies
This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.
Background
Enterprises are eager to embed powerful LLMs into products, but inference often demands prohibitive compute and memory resources. Faster inference reduces operational costs and improves user experience, making optimization a critical engineering challenge.
Key Acceleration Techniques
Use lower‑precision arithmetic (float16, bfloat16) to cut memory by 2× and increase token throughput by ~20%.
Apply 8‑bit or 4‑bit quantization to further halve or triple memory usage, noting a possible drop in prediction quality.
Fine‑tune with adapters (LoRA, QLoRA) and then combine with quantization for better accuracy on specific data.
Employ tensor parallelism to spread large models across multiple GPUs.
Leverage inference‑focused libraries such as Text Generation Inference, DeepSpeed, or vLLM, which integrate tensor parallelism, quantization, and continuous batch scheduling.
Run preliminary tests to verify library stability and performance on target hardware.
Prepare representative test datasets for rapid evaluation of any optimization.
Model Used for Experiments
The open‑source Falcon model (7B and 70B variants) from the Technology Innovation Institute serves as the primary benchmark. Falcon’s architecture resembles GPT‑3 and LLaMA, featuring multi‑query attention for efficiency.
Low‑Precision Inference
Switching from 32‑bit to 16‑bit precision reduces GPU memory consumption by half and speeds up token generation. Example with Lit‑GPT:
python generate/base.py \
--prompt "I am so fast that I can" \
--checkpoint_dir checkpoints/tiiuae/falcon-7b \
--max_new_tokens 50 \
--precision "16-true"
# Time for inference: 1.19 sec total, 42.03 tokens/sec
# Memory used: 14.50 GBMixed‑precision ("16‑mixed") retains 32‑bit precision for sensitive operations while using 16‑bit elsewhere, achieving comparable speed with higher memory usage.
Bfloat16 (Brain Float)
Bfloat16, originally designed for TPUs, is now supported on many NVIDIA GPUs. Verify support with:
python -c "import torch; print(torch.cuda.is_bf16_supported())"When available, inference with bfloat16 yields similar speed to float16 while preserving a wider dynamic range:
python generate/base.py \
--prompt "I am so fast that I can" \
--checkpoint_dir checkpoints/tiiuae/falcon-7b \
--max_new_tokens 50 \
--precision "bf16-true"
# Time for inference: 1.18 sec total, 42.47 tokens/sec
# Memory used: 14.50 GBQuantization
Two post‑training quantization approaches are discussed:
PTQ (Post‑Training Quantization) : Convert weights to 8‑bit or 4‑bit after training; low cost but may reduce quality.
QAT (Quantization‑Aware Training) : Incorporate quantization during fine‑tuning for better accuracy at higher computational expense.
Example using Lit‑LLaMA to quantize a LLaMA‑7B model to int8:
python generate.py \
--prompt "I am so fast that I can" \
--quantize llm.int8
# Time for inference: 2.01 sec total, 24.83 tokens/sec
# Memory used: 13.54 GBAdapter‑Based Fine‑Tuning
Adapters (e.g., LoRA, QLoRA) add lightweight trainable layers, freezing the original model weights. This reduces fine‑tuning memory and compute while preserving most of the base model’s knowledge.
python finetune/adapter_v2.py \
--data_dir data/alpaca \
--checkpoint_dir checkpoints/tiiuae/falcon-7b \
--out_dir out/adapter/alpacaPruning
Structured pruning methods such as LLM‑Pruner and Wanda remove less important weights or connections, shrinking model size without retraining. Wanda combines weight magnitude with activation norms for per‑output pruning, achieving up to 50% reduction with minimal accuracy loss.
Continuous Batch Inference
Batching multiple prompts together maximizes GPU memory bandwidth. Continuous (or iterative) batching inserts new prompts as soon as a sequence finishes, improving utilization compared to static batches. Libraries supporting this include:
Text Generation Inference
vLLM (uses PagedAttention, offering 14‑24× higher throughput than standard HF Transformers)
Multi‑GPU Deployment
Fully‑Sharded Data Parallel (FSDP) distributes model shards across GPUs, enabling inference of models that exceed a single card’s memory (e.g., Falcon‑40B on two A6000 GPUs). Example command:
python generate/base.py \
--checkpoint_dir checkpoints/tiiuae/falcon-40b \
--strategy fsdp \
--devices 2 \
--prompt "I am so fast that I can"
# Time for inference: 83.40 sec total, 0.60 tokens/sec
# Memory used: 46.10 GBvLLM can also achieve multi‑GPU speedups by setting tensor_parallel_size=2:
from vllm import LLM, SamplingParams
prompts = ["I am so fast that I can", "The capital of France is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="huggyllama/llama-30b", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)
# ~0.14 s for three promptsServing LLMs
For production deployment, the article recommends Docker‑based Text Generation Inference (TGI) with optional bitsandbytes quantization:
docker run --gpus all --shm-size 1g -p 8080:80 \
-v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:0.8 \
--model-id tiiuae/falcon-40b --num-shard 1 --quantize bitsandbytesAlternative serving stacks include Accelerate (CPU offloading), DeepSpeed Inference, DeepSpeed‑MII (GRPC endpoint), OpenLLM, and Aviary.
Conclusion
LLM inference optimization is a rapidly evolving field with many promising techniques, yet not all methods guarantee speed gains without sacrificing quality. Practitioners should benchmark each approach on their specific hardware and workloads, balancing software optimizations with model architecture considerations to achieve efficient, reliable deployment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
