Artificial Intelligence 12 min read

Boosting Large Model Inference: High‑Performance Optimization Techniques

This article explains the background, challenges, and high‑performance optimization methods for deploying large language and multimodal models, covering inference workflow analysis, distributed concurrency, latency reduction, quantization strategies, and service throughput improvements to achieve industry‑leading speed and memory efficiency.

Baidu Intelligent Cloud Tech Hub

Jul 31, 2023

Boosting Large Model Inference: High‑Performance Optimization Techniques

1. Demand and Challenges of Large‑Model Inference

Large language models such as GPT‑3, ChatGPT, Baidu Wenxin Yi, and multimodal diffusion models have attracted massive attention, but deploying them online faces three major difficulties: massive parameter size (e.g., 175 B parameters require ~350 GB of weights), huge computational load, and consequently high inference cost that limits user access.

2. High‑Performance Optimization for Generative Large Language Models

2.1 Inference Process Analysis

The inference pipeline consists of a Context (Encoder) stage that processes the input tokens, builds CacheKV, and a Generation (Decoder) stage that repeatedly samples one token at a time in a while‑loop, using the cached keys and values.

During Context, activation shapes are large (e.g., [B, 1000, 12288] for a 175 B model), while Generation always processes a single token, making Generation memory‑light but memory‑intensive in the Context phase.

2.2 Latency Optimization Methods

Distributed parallelism (tensor and pipeline parallelism) enables multi‑GPU inference for models that cannot fit on a single card.

Transformer‑fusion techniques (layer merging, attention‑FFN fusion, etc.) can bring inference speed comparable to NVIDIA FasterTransformer.

Advanced quantization methods further reduce latency and memory:

Dynamic INT8 quantization (LLM.int8()) keeps outlier channels in FP16 while converting the rest to INT8, suitable for compute‑intensive Context.

Weight‑Only quantization stores weights in low‑precision formats, ideal for the memory‑intensive Generation stage.

Post‑Training Quantization (PTQ) with SmoothQuant balances activation outliers and weight smoothness for higher accuracy.

2.3 Service Throughput Optimization

Increasing batch size improves throughput but raises memory usage; the presented quantization and fusion methods also lower memory, enabling larger batches.

Dynamic insertion monitors finished samples within a batch and injects new inputs to keep the batch fully occupied, thereby improving concurrency during generation.

3. High‑Performance Optimization for Multimodal Diffusion Models

Stable Diffusion’s UNet backbone shares the same attention‑heavy structure as LLMs, allowing reuse of the above optimizations. By integrating Flash Attention, supporting multiple Norm variants, and applying end‑to‑end layout and scheduler optimizations, inference speed and memory consumption reach industry‑leading levels.

Quantization Distributed inference multimodal diffusion

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.