Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.

AI Waka
AI Waka
AI Waka
Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

Background

Enterprises are eager to embed powerful LLMs into products, but inference often demands prohibitive compute and memory resources. Faster inference reduces operational costs and improves user experience, making optimization a critical engineering challenge.

Key Acceleration Techniques

Use lower‑precision arithmetic (float16, bfloat16) to cut memory by 2× and increase token throughput by ~20%.

Apply 8‑bit or 4‑bit quantization to further halve or triple memory usage, noting a possible drop in prediction quality.

Fine‑tune with adapters (LoRA, QLoRA) and then combine with quantization for better accuracy on specific data.

Employ tensor parallelism to spread large models across multiple GPUs.

Leverage inference‑focused libraries such as Text Generation Inference, DeepSpeed, or vLLM, which integrate tensor parallelism, quantization, and continuous batch scheduling.

Run preliminary tests to verify library stability and performance on target hardware.

Prepare representative test datasets for rapid evaluation of any optimization.

Model Used for Experiments

The open‑source Falcon model (7B and 70B variants) from the Technology Innovation Institute serves as the primary benchmark. Falcon’s architecture resembles GPT‑3 and LLaMA, featuring multi‑query attention for efficiency.

Low‑Precision Inference

Switching from 32‑bit to 16‑bit precision reduces GPU memory consumption by half and speeds up token generation. Example with Lit‑GPT:

python generate/base.py \
  --prompt "I am so fast that I can" \
  --checkpoint_dir checkpoints/tiiuae/falcon-7b \
  --max_new_tokens 50 \
  --precision "16-true"
# Time for inference: 1.19 sec total, 42.03 tokens/sec
# Memory used: 14.50 GB

Mixed‑precision ("16‑mixed") retains 32‑bit precision for sensitive operations while using 16‑bit elsewhere, achieving comparable speed with higher memory usage.

Bfloat16 (Brain Float)

Bfloat16, originally designed for TPUs, is now supported on many NVIDIA GPUs. Verify support with:

python -c "import torch; print(torch.cuda.is_bf16_supported())"

When available, inference with bfloat16 yields similar speed to float16 while preserving a wider dynamic range:

python generate/base.py \
  --prompt "I am so fast that I can" \
  --checkpoint_dir checkpoints/tiiuae/falcon-7b \
  --max_new_tokens 50 \
  --precision "bf16-true"
# Time for inference: 1.18 sec total, 42.47 tokens/sec
# Memory used: 14.50 GB

Quantization

Two post‑training quantization approaches are discussed:

PTQ (Post‑Training Quantization) : Convert weights to 8‑bit or 4‑bit after training; low cost but may reduce quality.

QAT (Quantization‑Aware Training) : Incorporate quantization during fine‑tuning for better accuracy at higher computational expense.

Example using Lit‑LLaMA to quantize a LLaMA‑7B model to int8:

python generate.py \
  --prompt "I am so fast that I can" \
  --quantize llm.int8
# Time for inference: 2.01 sec total, 24.83 tokens/sec
# Memory used: 13.54 GB

Adapter‑Based Fine‑Tuning

Adapters (e.g., LoRA, QLoRA) add lightweight trainable layers, freezing the original model weights. This reduces fine‑tuning memory and compute while preserving most of the base model’s knowledge.

python finetune/adapter_v2.py \
  --data_dir data/alpaca \
  --checkpoint_dir checkpoints/tiiuae/falcon-7b \
  --out_dir out/adapter/alpaca

Pruning

Structured pruning methods such as LLM‑Pruner and Wanda remove less important weights or connections, shrinking model size without retraining. Wanda combines weight magnitude with activation norms for per‑output pruning, achieving up to 50% reduction with minimal accuracy loss.

Continuous Batch Inference

Batching multiple prompts together maximizes GPU memory bandwidth. Continuous (or iterative) batching inserts new prompts as soon as a sequence finishes, improving utilization compared to static batches. Libraries supporting this include:

Text Generation Inference

vLLM (uses PagedAttention, offering 14‑24× higher throughput than standard HF Transformers)

Multi‑GPU Deployment

Fully‑Sharded Data Parallel (FSDP) distributes model shards across GPUs, enabling inference of models that exceed a single card’s memory (e.g., Falcon‑40B on two A6000 GPUs). Example command:

python generate/base.py \
  --checkpoint_dir checkpoints/tiiuae/falcon-40b \
  --strategy fsdp \
  --devices 2 \
  --prompt "I am so fast that I can"
# Time for inference: 83.40 sec total, 0.60 tokens/sec
# Memory used: 46.10 GB

vLLM can also achieve multi‑GPU speedups by setting tensor_parallel_size=2:

from vllm import LLM, SamplingParams
prompts = ["I am so fast that I can", "The capital of France is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="huggyllama/llama-30b", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)
# ~0.14 s for three prompts

Serving LLMs

For production deployment, the article recommends Docker‑based Text Generation Inference (TGI) with optional bitsandbytes quantization:

docker run --gpus all --shm-size 1g -p 8080:80 \
  -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:0.8 \
  --model-id tiiuae/falcon-40b --num-shard 1 --quantize bitsandbytes

Alternative serving stacks include Accelerate (CPU offloading), DeepSpeed Inference, DeepSpeed‑MII (GRPC endpoint), OpenLLM, and Aviary.

Conclusion

LLM inference optimization is a rapidly evolving field with many promising techniques, yet not all methods guarantee speed gains without sacrificing quality. Practitioners should benchmark each approach on their specific hardware and workloads, balancing software optimizations with model architecture considerations to achieve efficient, reliable deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMQuantizationvLLMGPUInferencemixed precisionadapter-fine-tuning
AI Waka
Written by

AI Waka

AI changes everything

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.