Artificial Intelligence 11 min read

Unlocking Chinese Text-to-Image Generation with Alibaba’s PAI‑Diffusion Models

This article introduces Alibaba Cloud’s open‑source PAI‑Diffusion series, detailing its Latent Diffusion Model foundation, Chinese CLIP alignment, super‑resolution components, and showcases diverse artistic and real‑world text‑to‑image generation scenarios, while providing guidance on accessing the models via Alibaba Cloud AI Center, PAI‑DSW, and HuggingFace Space.

Alibaba Cloud Big Data AI Platform

Dec 12, 2022

Unlocking Chinese Text-to-Image Generation with Alibaba’s PAI‑Diffusion Models

Overview

With the explosive growth of multimodal data and the increased compute power for training large deep‑learning models, AI‑generated content (AIGC) has surged, especially text‑to‑image generation. Popular models such as DALL‑E, Stable Diffusion, and others are primarily English‑centric and large‑scale, making them difficult for Chinese users. Alibaba Cloud’s Machine Learning (PAI) team open‑sourced the PAI‑Diffusion series, which includes general‑purpose and domain‑specific Chinese text‑to‑image models.

1. Latent Diffusion Model Principle

Diffusion models consist of a forward diffusion process that gradually adds noise to an image and a reverse diffusion process that denoises from random noise back to the image. By operating in a low‑dimensional latent space via an auto‑encoder, Latent Diffusion Models (LDM) dramatically reduce memory and time consumption. Text‑guided diffusion uses a text encoder and a U‑Net; the text encoder converts Chinese text into embeddings that condition the denoising process.

2. Stable Diffusion

Stable Diffusion is an LDM trained on a subset of the LAION‑5B dataset, capable of running on consumer‑grade GPUs. Version 1 supports text‑to‑image and sketch‑to‑image generation, while version 2 improves the text encoder, raises default resolution to 768×768, and adds depth‑to‑image and text‑guided inpainting capabilities.

3. Model Pipeline Architecture

The PAI‑Diffusion pipeline consists of four components:

Text Encoder: Chinese CLIP (EasyNLP) transforms input text into embedding vectors.

Latent Diffusion Model: Generates latent‑space representations conditioned on the text.

Auto Encoder: Decodes latent tensors back to pixel images.

Super‑Resolution Model: Enhances image resolution using ESRGAN.

4. Multi‑Scene Artistic Generation

General Scene

Examples of colorful fish swimming in an ocean.

Poetry Illustration

Chinese classical poems paired with vivid images.

Anime Style

Pink cherry‑blossom girl generated from a prompt.

Artistic Painting

Warm scenes such as “Tenderness under smoke”.

Fantasy Realism

Monstrous beasts devouring everything.

5. Real‑World Business Scenarios

E‑commerce Products

Generating high‑quality product images such as floral dresses.

World Cuisine

Creating realistic food images, e.g., Korean fried chicken.

6. Easy Experience with PAI‑Diffusion

The models can be tried through three main channels:

Alibaba Cloud AI Capability Center – an online portal showcasing diverse AIGC cases, including text‑to‑image generation.

PAI‑DSW (Data Science Workshop) – a cloud IDE with interactive notebooks and sample notebooks for Chinese text‑to‑image generation.

HuggingFace Space – public demos where users input a prompt (e.g., a dish name) to obtain high‑resolution images.

7. Future Outlook

Future work will integrate PAI‑Diffusion checkpoints into the EasyNLP framework, provide lightweight fine‑tuning interfaces for limited resources, and continue optimizing inference speed, image quality, and advanced editing capabilities. The Alibaba Cloud PAI team invites the community to contribute to Chinese multimodal research.

Reference

Chengyu Wang, Minghui Qiu, Taolin Zhang, et al. EasyNLP: A Comprehensive and Easy‑to‑use Toolkit for Natural Language Processing. EMNLP 2022.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, et al. High‑Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.

Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.

Jiaxi Gu, Xiaojun Meng, Guansong Lu, et al. Wukong: 100 Million Large‑scale Chinese Cross‑modal Pre‑training Dataset and A Foundation Framework. arXiv.

Ling Yang, Zhilong Zhang, Yang Song, et al. Diffusion models: A comprehensive survey of methods and applications. arXiv.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-to-image diffusion models Generative AI Alibaba Cloud Chinese AI

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.