Rapid Diffusion: Fast, Domain‑Specific Text‑to‑Image Generation for Chinese

Rapid Diffusion introduces a knowledge‑enhanced, high‑speed Chinese text‑to‑image diffusion model with one‑click deployment, achieving superior image quality and up to 1.73× faster inference through FlashAttention and BladeDISC optimizations, and demonstrates strong performance across e‑commerce, traditional painting, and food datasets.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Rapid Diffusion: Fast, Domain‑Specific Text‑to‑Image Generation for Chinese

Alibaba Cloud's Machine Learning Platform PAI, in collaboration with South China University of Technology, presented Rapid Diffusion at ACL 2023. This Chinese domain‑specific text‑to‑image diffusion model, built on the Stable Diffusion architecture, offers fast generation, one‑click deployment, and fine‑tuning on personal data.

Background

Text‑to‑Image Synthesis (TIS) generates images from textual prompts. Recent advances in large pre‑trained models and diffusion techniques have produced high‑quality images, but existing open‑source models lack domain‑specific entity knowledge and suffer from slow inference, limiting industrial applications.

Algorithm Overview

Rapid Diffusion improves Stable Diffusion by injecting rich entity knowledge into the CLIP text encoder using a knowledge graph, and integrates an ESRGAN‑based super‑resolution network after the diffusion module to boost image resolution while controlling model size and latency. For deployment, a FlashAttention‑optimized architecture and the BladeDISC AI compiler are used to accelerate inference.

Rapid Diffusion framework
Rapid Diffusion framework

Knowledge‑Enhanced Text Encoder

To better understand Chinese entities, the model is pre‑trained on 100 million image‑text pairs from the Wukong dataset and the OpenKG Chinese knowledge graph (16 million entities, 140 million triples). Entity tokens are enriched with knowledge‑graph embeddings derived via the TransE algorithm, producing a trainable Chinese CLIP encoder suitable for domain adaptation.

Latent Space Noise Predictor

The latent diffusion component uses a cross‑attention U‑Net to predict noise conditioned on the enriched text embeddings. Training employs classifier‑free guidance and the PNDM sampler to reduce sampling steps. The model is first pre‑trained on Wukong data, then fine‑tuned on domain‑specific corpora.

Training loss for latent diffusion
Training loss for latent diffusion

Super‑Resolution Network

Generated images (256×256) are upscaled using a pre‑trained ESRGAN model, achieving higher resolution without the latency penalties of a second diffusion pass.

Inference Acceleration Design

Profiling revealed the U‑Net cross‑attention as the main bottleneck. The solution combines automatic operator slicing, kernel fusion, and I/O‑aware attention (FlashAttention) to reduce memory traffic. These optimizations, together with BladeDISC compilation, yield a 1.9× speed‑up for the U‑Net and an overall 1.73× inference acceleration.

CUDA time breakdown of model inference
CUDA time breakdown of model inference

Algorithm Accuracy Evaluation

Rapid Diffusion was evaluated on three Chinese image‑text datasets (e‑commerce, traditional painting, food). It achieved lower FID scores than baseline models, with an average FID of 21.90, demonstrating superior realism and diversity.

FID comparison table
FID comparison table

Text‑image retrieval experiments show that the knowledge‑enhanced Chinese CLIP (CKCLIP) significantly outperforms the standard CLIP, especially on the R@1 metric.

Retrieval performance table
Retrieval performance table

Inference speed tests confirm a 1.73× acceleration using BladeDISC and FlashAttention, with the approach applicable to other diffusion models such as Stable Diffusion and Taiyi Diffusion.

Inference acceleration results
Inference acceleration results

The model and code will be contributed to the EasyNLP framework, inviting NLP researchers to adopt and extend the system.

EasyNLP repository: https://github.com/alibaba/EasyNLP

References

Chengyu Wang, Minghui Qiu, Taolin Zhang, et al. EasyNLP: A Comprehensive and Easy‑to‑use Toolkit for Natural Language Processing. EMNLP 2022.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, et al. High‑resolution image synthesis with latent diffusion models. CVPR 2022.

Jonathan Ho, Ajay Jain, Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS 2020.

Jiaming Song, Chenlin Meng, Stefano Ermon. Denoising diffusion implicit models. ICLR 2021.

Kai Zhu, WY Zhao, Zhen Zheng, et al. DISC: A dynamic shape compiler for machine learning workloads. MLSys 2021.

Tri Dao, Daniel Y. Fu, Stefano Ermon, et al. FlashAttention: Fast and memory‑efficient exact attention with I/O‑awareness. arXiv 2022.

diffusion modeltext-to-imageChinese NLPKnowledge Enhancementfast inference
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.