Artificial Intelligence 20 min read

Unlock Chinese Text-to-Image Generation with EasyNLP’s Open‑Source Models

This article introduces EasyNLP’s newly integrated Chinese text‑to‑image generation framework, explains the underlying Transformer‑VQGAN architecture, provides model specifications, code snippets, performance benchmarks on multiple datasets, and step‑by‑step tutorials for fine‑tuning and inference using open‑source checkpoints.

Alibaba Cloud Big Data AI Platform

Jul 29, 2022

Unlock Chinese Text-to-Image Generation with EasyNLP’s Open‑Source Models

Introduction

Multimodal data (text, image, audio) has exploded in recent years, driving demand for cross‑modal understanding and generation. Text‑to‑image generation, a popular cross‑modal task, enables AI to create images from textual prompts, with notable models such as OpenAI’s DALL‑E, DALL‑E 2, Google’s Parti and Imagen.

These models are large, English‑centric, and difficult for the community to fine‑tune. EasyNLP, Alibaba Cloud’s PyTorch‑based NLP framework, now integrates a Transformer + VQGAN text‑to‑image architecture and releases Chinese checkpoints for free, allowing users to fine‑tune with modest resources.

Text‑to‑Image Model Overview

Classic models include:

DALL‑E: two‑stage pipeline with a discrete VAE (dVAE) that compresses 256×256 RGB images into 32×32 image tokens, followed by an autoregressive Transformer that predicts the token sequence from text.

CogView: improves the two‑stage pipeline with SentencePiece tokenization and techniques such as super‑resolution and style transfer during fine‑tuning.

ERNIE‑ViLG: jointly learns text‑to‑image and image‑to‑text tasks, sharing a Transformer backbone.

OFA: unifies many cross‑modal generation tasks in a single architecture.

Diffusion‑based models (e.g., Google’s diffusion architecture) generate high‑resolution images.

Figures of these architectures are illustrated in the original article.

EasyNLP Text‑to‑Image Model

The EasyNLP model follows a two‑stage design: image vector quantization with VQGAN and autoregressive training with a GPT‑style Transformer. VQGAN is pretrained on ImageNet (f16_16384) and provides a codebook of 16 384 entries. The Transformer processes concatenated text tokens and previously generated image tokens, predicting the next image token.

Two Chinese checkpoints are provided:

pai‑painter‑base‑zh : 202 M parameters, 12 layers, 12 attention heads, hidden size 768.

pai‑painter‑large‑zh : 433 M parameters, 24 layers, 16 attention heads, hidden size 1024.

Both models use 32‑token text length, 16×16 image token length, and generate 256×256 images.

Model Implementation

self.first_stage_model = VQModel(ckpt_path=vqgan_ckpt_path).eval()
self.transformer = GPT(self.config)

Encoding (VQModel):

# in easynlp/appzoo/text2image_generation/model.py
@torch.no_grad()
def encode_to_z(self, x):
    quant_z, _, info = self.first_stage_model.encode(x)
    indices = info[2].view(quant_z.shape[0], -1)
    return quant_z, indices

x = inputs['image']
x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)
_, z_indices = self.encode_to_z(x)  # shape [batch_size, 256]

Decoding (VQModel):

# in easynlp/appzoo/text2image_generation/model.py
@torch.no_grad()
def decode_to_img(self, index, zshape):
    bhwc = (zshape[0], zshape[2], zshape[3], zshape[1])
    quant_z = self.first_stage_model.quantize.get_codebook_entry(
        index.reshape(-1), shape=bhwc)
    x = self.first_stage_model.decode(quant_z)
    return x

Transformer forward (GPT):

# in easynlp/appzoo/text2image_generation/model.py
def forward(self, inputs):
    x = inputs['image']
    c = inputs['text']
    x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)
    _, z_indices = self.encode_to_z(x)
    c_indices = c
    if self.training and self.pkeep < 1.0:
        mask = torch.bernoulli(self.pkeep * torch.ones(z_indices.shape, device=z_indices.device)).round().long()
        r_indices = torch.randint_like(z_indices, self.transformer.config.vocab_size)
        a_indices = mask * z_indices + (1 - mask) * r_indices
    else:
        a_indices = z_indices
        cz_indices = torch.cat((c_indices, a_indices), dim=1)
        target = z_indices
        logits, _ = self.transformer(cz_indices[:, :-1])
        logits = logits[:, c_indices.shape[1]-1:]
    return logits, target

Generation loop (autoregressive):

def generate(self, inputs, top_k=100, temperature=1.0):
    cidx = inputs
    steps = 256
    for k in range(steps):
        logits, _ = self.transformer(cidx)
        logits = logits[:, -1, :] / temperature
        if top_k is not None:
            logits = self.top_k_logits(logits, top_k)
        probs = torch.nn.functional.softmax(logits, dim=-1)
        ix = torch.multinomial(probs, num_samples=1)
        cidx = torch.cat((cidx, ix), dim=1)
    img_idx = cidx[:, 32:]
    return img_idx

Model Performance

Evaluated on four Chinese datasets (COCO‑CN, MUGE, Flickr8k‑CN, Flickr30k‑CN) and compared against CogView and DALL‑E. Results show competitive visual quality, especially for the large checkpoint.

Examples

Sample generations include natural scenery, e‑commerce product images, and artistic Chinese paintings. Images are displayed in the original article.

Tutorial

Installation : Follow the EasyNLP GitHub page.

Data preparation : Create TSV files with idx\ttext\tbase64_image columns for training/validation and idx\ttext for testing.

Training command (fine‑tune base or large model):

easynlp \
    --mode=train \
    --worker_gpu=1 \
    --tables=MUGE_train_text_imgbase64.tsv,MUGE_val_text_imgbase64.tsv \
    --input_schema=idx:str:1,text:str:1,imgbase64:str:1 \
    --first_sequence=text \
    --second_sequence=imgbase64 \
    --checkpoint_dir=./finetuned_model/ \
    --learning_rate=4e-5 \
    --epoch_num=1 \
    --random_seed=42 \
    --logging_steps=100 \
    --save_checkpoint_steps=1000 \
    --sequence_length=288 \
    --micro_batch_size=16 \
    --app_name=text2image_generation \
    --user_defined_parameters='\
        pretrain_model_name_or_path=alibaba-pai/pai-painter-large-zh\
        size=256\
        text_len=32\
        img_len=256\
        img_vocab_size=16384\
    '

Inference command (batch generation):

easynlp \
    --mode=predict \
    --worker_gpu=1 \
    --tables=MUGE_test.text.tsv \
    --input_schema=idx:str:1,text:str:1 \
    --first_sequence=text \
    --outputs=./T2I_outputs.tsv \
    --output_schema=idx,text,gen_imgbase64 \
    --checkpoint_dir=./finetuned_model/ \
    --sequence_length=288 \
    --micro_batch_size=8 \
    --app_name=text2image_generation \
    --user_defined_parameters='\
        size=256\
        text_len=32\
        img_len=256\
        img_vocab_size=16384\
    '

Pipeline usage (quick demo):

# Build pipeline
pipeline = pipeline("pai-painter-commercial-base-zh")
# Predict
result = pipeline(["宽松T恤"])
# Convert base64 to image
from PIL import Image
from io import BytesIO
import base64
img = Image.open(BytesIO(base64.urlsafe_b64decode(result[0]["gen_imgbase64"])))
img.save("宽松T恤.png")

Other pretrained pipelines are available for scenery and Chinese painting scenes.

Future Work

EasyNLP will continue to add more state‑of‑the‑art Chinese multimodal models and support a broader range of NLP and cross‑modal tasks. The community is invited to contribute and follow the ongoing research from Alibaba Cloud’s PAI team.

text-to-image multimodal AI generation Chinese NLP EasyNLP VQGAN

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.