Artificial Intelligence 16 min read

How Baidu’s AIAK‑Inference Supercharges AI Model Inference on GPUs

This article provides an end‑to‑end analysis of AI inference bottlenecks, reviews common industry acceleration techniques, and details Baidu Intelligent Cloud’s AIAK‑Inference suite—including its architecture, optimization strategies such as model pruning, operator fusion, and single‑operator tuning—followed by a demo showing significant latency reductions on ResNet‑50 and other models.

Baidu Geek Talk

Jan 5, 2023

How Baidu’s AIAK‑Inference Supercharges AI Model Inference on GPUs

1. AI Inference Pain Points

AI inference transforms user input through a trained model deployed on hardware such as GPUs, exposing the service via HTTP/RPC. Two main stakeholders—algorithm engineers who want fast, accurate serving, and infrastructure engineers who aim to maximize GPU utilization—face distinct challenges.

Algorithm engineers need the model to serve quickly, while infrastructure engineers need to keep expensive GPU resources fully utilized. In practice, the end‑to‑end workflow includes resource request, GPU allocation, execution by an AI framework, and finally kernel launch on the GPU. Existing frameworks provide easy model APIs but do not optimise for inference latency or GPU utilisation, and they rely on generic acceleration libraries that are not tailored to specific workloads.

Consequently, there is no dedicated tool that ensures GPUs are constantly busy with useful work, leading to low GPU and SM utilisation and higher inference latency.

2. Industry Acceleration Solutions

To evaluate optimisation goals, we first recall GPU architecture: multiple SMs each containing ALUs and Tensor Cores. Full GPU utilisation requires all SMs to have active warps. NVIDIA defines two metrics—GPU utilisation (time any task runs on the GPU) and SM utilisation (average active warp time per SM).

Two illustrative cases show the limitation of GPU utilisation alone: (1) frequent idle gaps between tasks cause low GPU and SM utilisation; (2) a kernel that occupies only one SM while others stay idle yields 100 % GPU utilisation but only 25 % SM utilisation. SM utilisation therefore reflects the efficiency of task scheduling more precisely.

Based on these metrics, optimisation can be grouped into three categories:

Model‑level pruning: quantisation, pruning, distillation, NAS, etc., performed before deployment to reduce compute.

Operator‑level optimisation: keep the GPU busy by fusing many small operators into larger ones, reducing kernel launch overhead and memory traffic.

Single‑operator tuning: adapt kernel implementations to the hardware (e.g., GEMM, Conv) using scheduling, memory‑access patterns, and templated code generation.

3. AIAK‑Inference Acceleration Suite

AIAK‑Inference is Baidu Intelligent Cloud’s AI inference acceleration kit, part of the Baidu BaiGe solution. It targets heterogeneous GPU resources purchased on Baidu Cloud, aiming to lower latency and increase throughput without changing user inference code.

The architecture consists of four layers:

Graph ingestion: captures dynamic/static graphs from various frameworks and converts them to inference‑friendly static graphs.

Backend abstraction: unifies multiple optimisation back‑ends and selects the best one by timing.

Accelerated back‑ends: integrates open‑source solutions such as FastDeploy and a proprietary back‑end that performs graph optimisation, conversion, and runtime acceleration.

Operator library: combines industry‑standard operators with custom, scene‑specific kernels.

Key differentiators are “multi‑backend seamless integration” and “scene‑aware custom operators”. Optimisation follows the three‑layer approach described earlier: graph‑level pruning (including quantisation, pruning, distillation, dead‑code elimination), operator fusion (memory‑intensive, GEMM/Conv tail‑fusion, back‑to‑back GEMM), and single‑operator tuning (scheduling, memory layout, templated kernels).

Examples include a 20 % reduction of memory‑instruction waste in a Conv kernel (yielding a 3 % end‑to‑end speed‑up) and the development of fused multi‑head attention (FMHA) and YoloBox operators for NLP and CV workloads.

AIAK‑Inference also tracks the latest ecosystem, providing a Dynamo backend for PyTorch 2.0 and an automated template‑driven operator generation pipeline.

4. Using AIAK‑Inference

To use the suite, users prepare an environment via a Docker image or a Python wheel. The workflow adds a single optimisation script that calls aiak_inference.compile (or optimize) on a saved model (e.g., TorchScript or SavedModel) and outputs an optimised model.

Demo with ResNet‑50 on an NVIDIA T4 GPU:

Baseline inference (1000 runs) yields ~6.73 ms latency per request.

After optimisation with optimize.py, latency drops to ~3.54 ms, a 47 % reduction.

Further experiments on six typical CV models show latency reductions ranging from 40 % to 90 %.

Overall, AIAK‑Inference enables transparent, zero‑code‑intrusion optimisation of AI models, delivering substantial inference speed‑ups on Baidu Cloud GPUs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Optimization Performance Tuning AI inference Baidu Cloud AIAK-Inference

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.