Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization
This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.
Background – In the era of large multimodal models, Vision‑Language Models (VLMs) based on Transformer architectures achieve superior visual understanding but suffer from high inference cost, especially in batch‑oriented scenarios such as 58.com’s furniture and safety‑detection pipelines.
Scenario Overview – The primary workload is batch processing of image‑plus‑prompt tasks, with >98% of requests producing short token outputs (≤5 tokens). Two minor cases involve long‑token inputs and outputs, which are not the focus of optimization.
Performance Metrics – Throughput (queries per minute) and latency (ms per token) are the key indicators; the batch‑oriented nature makes throughput the dominant metric.
4.1 Image Pre‑processing Optimization – Replaced Pillow‑based resize/patch partitioning with OpenCV, cutting average per‑image preprocessing time from 23.67 ms to 12.03 ms (≈49% faster). Differences stem from distinct interpolation and border‑handling implementations.
4.2 ViT Module TensorRT Support – Converted the ViT sub‑graph to ONNX, then to TensorRT, applying layer fusion and precision calibration. Memory‑copy logic was streamlined by removing unnecessary CPU‑side copies in lmdeploy/vl/engine.py , lmdeploy/serve/vl_async_engine.py , and lmdeploy/pytorch/message.py . TensorRT acceleration reduced ViT feature‑extract time by ~45% and overall inference latency by ~70 ms.
4.3 ViT Module CUDA‑Graph Support – Integrated torch.cuda.CUDAGraph to capture the ViT inference graph, achieving ~30 ms per‑batch latency reduction. A graph pool indexed by batch size handles variable batch dimensions while respecting CUDA‑Graph static‑shape constraints.
4.4 Image Tokenization Reduction – Adjusted patch‑splitting logic to lower the number of image tokens. For a 480×320 image the token count dropped from 3328 to 512, doubling batch throughput. Token count per patch is computed as image_tokens_per_patch = (force_image_size // patch_size)**2 * (downsample_ratio**2) (e.g., 256 tokens for a 448×448 patch).
4.5 Prefix‑Cache in Multimodal Models – Modified KV‑cache handling to exclude image tokens from prefix caching, preventing incorrect reuse across requests. This preserves correct generation while still benefiting from cache reuse for textual prefixes.
4.6 Model Quantization – Applied post‑training weight quantization (AWQ and GPTQ) to produce W4A16 models. AWQ uses symmetric quantization with activation‑aware scaling; GPTQ employs block‑wise Hessian‑guided optimization. Benchmarks on RTX 4090 show mixed effects: latency improvements for short outputs are modest, while long‑output decoding sees noticeable throughput gains.
Evaluation – Using the InternVL2‑8B model on a dataset of 4524 housing‑inspection images, the optimized LMDeploy‑0.6.0 pipeline achieved a 3.05× increase in QPM without degrading recall.
Conclusion – A combination of preprocessing acceleration, TensorRT and CUDA‑Graph integration, token count reduction, careful prefix‑cache handling, and selective quantization can dramatically improve VLM inference throughput in batch‑centric production environments.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.