Unlocking Chinese Text-to-Image Generation with Alibaba’s PAI‑Diffusion Models
This article introduces Alibaba Cloud’s open‑source PAI‑Diffusion series, detailing its Latent Diffusion Model foundation, Chinese CLIP alignment, super‑resolution components, and showcases diverse artistic and real‑world text‑to‑image generation scenarios, while providing guidance on accessing the models via Alibaba Cloud AI Center, PAI‑DSW, and HuggingFace Space.
Overview
With the explosive growth of multimodal data and the increased compute power for training large deep‑learning models, AI‑generated content (AIGC) has surged, especially text‑to‑image generation. Popular models such as DALL‑E, Stable Diffusion, and others are primarily English‑centric and large‑scale, making them difficult for Chinese users. Alibaba Cloud’s Machine Learning (PAI) team open‑sourced the PAI‑Diffusion series, which includes general‑purpose and domain‑specific Chinese text‑to‑image models.
1. Latent Diffusion Model Principle
Diffusion models consist of a forward diffusion process that gradually adds noise to an image and a reverse diffusion process that denoises from random noise back to the image. By operating in a low‑dimensional latent space via an auto‑encoder, Latent Diffusion Models (LDM) dramatically reduce memory and time consumption. Text‑guided diffusion uses a text encoder and a U‑Net; the text encoder converts Chinese text into embeddings that condition the denoising process.
2. Stable Diffusion
Stable Diffusion is an LDM trained on a subset of the LAION‑5B dataset, capable of running on consumer‑grade GPUs. Version 1 supports text‑to‑image and sketch‑to‑image generation, while version 2 improves the text encoder, raises default resolution to 768×768, and adds depth‑to‑image and text‑guided inpainting capabilities.
3. Model Pipeline Architecture
The PAI‑Diffusion pipeline consists of four components:
Text Encoder: Chinese CLIP (EasyNLP) transforms input text into embedding vectors.
Latent Diffusion Model: Generates latent‑space representations conditioned on the text.
Auto Encoder: Decodes latent tensors back to pixel images.
Super‑Resolution Model: Enhances image resolution using ESRGAN.
4. Multi‑Scene Artistic Generation
General Scene
Examples of colorful fish swimming in an ocean.
Poetry Illustration
Chinese classical poems paired with vivid images.
Anime Style
Pink cherry‑blossom girl generated from a prompt.
Artistic Painting
Warm scenes such as “Tenderness under smoke”.
Fantasy Realism
Monstrous beasts devouring everything.
5. Real‑World Business Scenarios
E‑commerce Products
Generating high‑quality product images such as floral dresses.
World Cuisine
Creating realistic food images, e.g., Korean fried chicken.
6. Easy Experience with PAI‑Diffusion
The models can be tried through three main channels:
Alibaba Cloud AI Capability Center – an online portal showcasing diverse AIGC cases, including text‑to‑image generation.
PAI‑DSW (Data Science Workshop) – a cloud IDE with interactive notebooks and sample notebooks for Chinese text‑to‑image generation.
HuggingFace Space – public demos where users input a prompt (e.g., a dish name) to obtain high‑resolution images.
7. Future Outlook
Future work will integrate PAI‑Diffusion checkpoints into the EasyNLP framework, provide lightweight fine‑tuning interfaces for limited resources, and continue optimizing inference speed, image quality, and advanced editing capabilities. The Alibaba Cloud PAI team invites the community to contribute to Chinese multimodal research.
Reference
Chengyu Wang, Minghui Qiu, Taolin Zhang, et al. EasyNLP: A Comprehensive and Easy‑to‑use Toolkit for Natural Language Processing. EMNLP 2022.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, et al. High‑Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
Jiaxi Gu, Xiaojun Meng, Guansong Lu, et al. Wukong: 100 Million Large‑scale Chinese Cross‑modal Pre‑training Dataset and A Foundation Framework. arXiv.
Ling Yang, Zhilong Zhang, Yang Song, et al. Diffusion models: A comprehensive survey of methods and applications. arXiv.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
