How Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle Boost AI Painting Performance
This article analyzes Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle‑optimized Stable Diffusion models, presenting benchmark comparisons, hardware‑specific speed and memory gains, and the underlying inference optimizations that enable low‑cost, high‑throughput AI‑generated image creation.
Background
AIGC (AI‑Generated Content) has become a major research direction in deep learning, with AI‑driven painting being a prominent application. Diffusion‑based text‑to‑image models such as Stable Diffusion have generated strong demand for efficient deployment.
Model Performance Highlights
Baidu’s knowledge‑enhanced multimodal model ERNIE‑ViLG 2.0 surpasses Stable Diffusion and DALL‑E 2 on the MS‑COCO benchmark and human blind‑evaluation, showing superior semantic controllability, image clarity, and understanding of Chinese cultural concepts.
Benchmark Results
On a single NVIDIA A100 (80 GB) GPU, PaddlePaddle inference of Stable Diffusion achieves 68.2 iters/s (0.76 s per 512×512 image), which is 4 × faster than Diffusers (PyTorch) and 7.9 % faster than the best TensorRT configuration while using only 43 % of TensorRT’s memory.
On Baidu’s Kunlun R200 (32 GB) accelerator, ERNIE‑ViLG 2.0 inference is 20 % faster than comparable mainstream inference cards and consumes less memory, enabling higher‑resolution generation.
Key Inference Optimizations
Flash Attention
PaddlePaddle integrates a high‑performance Flash Attention kernel that splits the softmax computation and reduces memory accesses for self‑attention and cross‑attention, accelerating inference and lowering memory usage.
Norm Fusion
LayerNorm and GroupNorm operators in the U‑Net are fused with surrounding element‑wise and activation ops. PaddlePaddle merges 93 distinct norm patterns, yielding a 3 % inference speed improvement.
Mixed Layout Computation
Tensor layout matching eliminates redundant transposes in the U‑Net, removing 32 transpose operations and delivering a 3‑4 % speed boost while also cutting memory consumption.
Scheduler Optimization
The scheduler logic in the PPDiffusers library is streamlined: GPU kernel launches per scheduler.step drop from ~12 to 7, and pre‑computed parameters remove CPU work and GPU‑CPU synchronization during sampling loops.
Inference Memory Optimization
Operator fusion reduces the number of independent U‑Net operators by 60 %, cutting memory usage by 27 %. Layout‑aware optimizations further lower overall memory by ~19 %. For ERNIE‑ViLG 2.0, workspace reuse reduces memory consumption by 37 %.
Combined, these techniques allow Stable Diffusion to run on a single A100 (80 GB) with 0.76 s latency, 68.2 iters/s speed, and only 4.6 GB memory – a best‑in‑class result.
Deployment Tools and References
The open‑source PaddlePaddle diffusion toolbox (PPDiffusers) provides end‑to‑end training and inference pipelines. Repository: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers
FastDeploy offers ready‑to‑use deployment packages for Stable Diffusion on GPU and Kunlun R200. Repository: https://github.com/PaddlePaddle/FastDeploy/tree/develop/examples/multimodal/stable_diffusion
Future Work
PaddlePaddle will continue to optimize large‑scale generative models, expanding end‑to‑end training, compression, and inference pipelines to further reduce deployment costs and accelerate industry adoption of AIGC technologies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
