Artificial Intelligence 20 min read

Unlock Chinese Text‑to‑Image Generation with EasyNLP: Models, Code & Tutorials

This article introduces EasyNLP's Chinese text‑to‑image generation framework, explains the underlying Transformer‑VQGAN architecture, provides model specifications, showcases sample outputs, and offers step‑by‑step code and command‑line instructions for fine‑tuning and inference.

Alibaba Cloud Developer

Jul 28, 2022

Unlock Chinese Text‑to‑Image Generation with EasyNLP: Models, Code & Tutorials

Multimodal data (text, image, audio) drives the rapid growth of content‑centric AI, and text‑to‑image generation—exemplified by OpenAI's DALL‑E, DALL‑E 2, Google Parti and Imagen—has become a flagship cross‑modal task. Existing large‑scale models are rarely usable for Chinese prompts and are too heavy for most open‑source users.

Text‑to‑Image Generation Model Overview

DALL‑E uses a two‑stage pipeline: a discrete VAE (dVAE) compresses 256×256 RGB images into 32×32 image tokens, and a autoregressive Transformer predicts those tokens from text. CogView improves this by employing SentencePiece tokenization and techniques such as super‑resolution and style transfer during fine‑tuning.

ERNIE‑ViLG extends the Transformer to jointly learn text‑to‑image and image‑to‑text tasks.

Recent advances such as OFA unify multiple cross‑modal generation tasks, while diffusion models from Google enable high‑resolution image synthesis.

EasyNLP Text‑to‑Image Model

EasyNLP integrates a Transformer + VQGAN architecture for Chinese text‑to‑image generation, offering checkpoints of two sizes (202 M and 433 M parameters) that can be fine‑tuned with modest resources.

Model Architecture

The training follows a two‑stage process: first, VQGAN encodes images into discrete tokens (16×16 sequence, codebook size 16384); second, a GPT‑style Transformer autoregressively generates the image token sequence conditioned on the text tokens.

Open‑Source Model Parameters

pai‑painter‑base‑zh: 202 M parameters, 12 layers, 12 attention heads, hidden size 768

pai‑painter‑large‑zh: 433 M parameters, 24 layers, 16 attention heads, hidden size 1024

Both models use VQGAN f16_16384 pretrained on ImageNet, image size 256×256, text length 32, image token length 16×16

Implementation Code

self.first_stage_model = VQModel(ckpt_path=vqgan_ckpt_path).eval()
self.transformer = GPT(self.config)

# encode_to_z
quant_z, _, info = self.first_stage_model.encode(x)
indices = info[2].view(quant_z.shape[0], -1)
return quant_z, indices

# decode_to_img
bhwc = (zshape[0], zshape[2], zshape[3], zshape[1])
quant_z = self.first_stage_model.quantize.get_codebook_entry(index.reshape(-1), shape=bhwc)
x = self.first_stage_model.decode(quant_z)
return x

# forward (training)
logits, _ = self.transformer(cz_indices[:, :-1])
logits = logits[:, c_indices.shape[1]-1:]
return logits, target

Model Effect

Evaluated on four Chinese public datasets (COCO‑CN, MUGE, Flickr8k‑CN, Flickr30k‑CN) and compared with CogView and DALL‑E.

Classic Cases

Examples on natural scenery (COCO‑CN) and e‑commerce items demonstrate the quality of both base and large models.

Usage Tutorial

Install EasyNLP

Follow the official installation guide (see reference [15]).

Data Preparation

Prepare TSV files with three columns: index, text, and base64‑encoded image. For testing, only index and text are needed.

import base64
from io import BytesIO
from PIL import Image

img = Image.open(fn)
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data)

Model Training

easynlp \
    --mode=train \
    --worker_gpu=1 \
    --tables=MUGE_val_text_imgbase64.tsv,MUGE_val_text_imgbase64.tsv \
    --input_schema=idx:str:1,text:str:1,imgbase64:str:1 \
    --first_sequence=text \
    --second_sequence=imgbase64 \
    --checkpoint_dir=./finetuned_model/ \
    --learning_rate=4e-5 \
    --epoch_num=1 \
    --random_seed=42 \
    --logging_steps=100 \
    --save_checkpoint_steps=1000 \
    --sequence_length=288 \
    --micro_batch_size=16 \
    --app_name=text2image_generation \
    --user_defined_parameters='\
        pretrain_model_name_or_path=alibaba-pai/pai-painter-large-zh\
        size=256\
        text_len=32\
        img_len=256\
        img_vocab_size=16384\
    '

Batch Inference

easynlp \
    --mode=predict \
    --worker_gpu=1 \
    --tables=MUGE_test.text.tsv \
    --input_schema=idx:str:1,text:str:1 \
    --first_sequence=text \
    --outputs=./T2I_outputs.tsv \
    --output_schema=idx,text,gen_imgbase64 \
    --checkpoint_dir=./finetuned_model/ \
    --sequence_length=288 \
    --micro_batch_size=8 \
    --app_name=text2image_generation \
    --user_defined_parameters='\
        size=256\
        text_len=32\
        img_len=256\
        img_vocab_size=16384\
    '

Pipeline Quick Demo

# Build pipeline
default_ecommercial_pipeline = pipeline("pai-painter-commercial-base-zh")
# Predict
data = ["宽松T恤"]
results = default_ecommercial_pipeline(data)
# Convert base64 to image
from PIL import Image
from io import BytesIO
import base64

def base64_to_image(imgbase64_str):
    return Image.open(BytesIO(base64.urlsafe_b64decode(imgbase64_str)))

for text, result in zip(data, results):
    img = base64_to_image(result['gen_imgbase64'])
    img.save(f"{text}.png")
    print(f"text: {text}, saved image: {text}.png")

Future Outlook

EasyNLP will continue to release more Chinese multimodal models and integrate state‑of‑the‑art architectures for various NLP and vision‑language tasks. The community is invited to contribute and co‑build the next generation of open‑source Chinese AI tools.

Reference

Chengyu Wang et al., "EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing", arXiv.

Aditya Ramesh et al., "Zero‑Shot Text‑to‑Image Generation", ICML 2021.

Ming Ding et al., "CogView: Mastering Text‑to‑Image Generation via Transformers", NeurIPS 2021.

Han Zhang et al., "ERNIE‑ViLG: Unified Generative Pre‑training for Bidirectional Vision‑Language Generation", arXiv.

Peng Wang et al., "Unifying Architectures, Tasks, and Modalities Through a Simple Sequence‑to‑Sequence Learning Framework", ICML 2022.

Aditya Ramesh et al., "Hierarchical Text‑Conditional Image Generation with CLIP Latents", arXiv.

Van Den Oord & Vinyals, "Neural Discrete Representation Learning", NIPS 2017.

Esser et al., "Taming Transformers for High‑Resolution Image Synthesis", CVPR 2021.

Chitwan Saharia et al., "Photorealistic Text‑to‑Image Diffusion Models with Deep Language Understanding", arXiv.

Jiahui Yu et al., "Scaling Autoregressive Models for Content‑Rich Text‑to‑Image Generation", arXiv.

https://zhuanlan.zhihu.com/p/528476134

http://tianchi.aliyun.com/muge

https://github.com/THUDM/CogView

https://github.com/lucidrains/DALLE-pytorch

https://github.com/alibaba/EasyNLP

Transformer model fine-tuning text-to-image multimodal Chinese AI EasyNLP VQGAN

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.