Unlock Chinese Text‑to‑Image Generation with EasyNLP: Models, Code & Tutorials
This article introduces EasyNLP's Chinese text‑to‑image generation framework, explains the underlying Transformer‑VQGAN architecture, provides model specifications, showcases sample outputs, and offers step‑by‑step code and command‑line instructions for fine‑tuning and inference.
Multimodal data (text, image, audio) drives the rapid growth of content‑centric AI, and text‑to‑image generation—exemplified by OpenAI's DALL‑E, DALL‑E 2, Google Parti and Imagen—has become a flagship cross‑modal task. Existing large‑scale models are rarely usable for Chinese prompts and are too heavy for most open‑source users.
Text‑to‑Image Generation Model Overview
DALL‑E uses a two‑stage pipeline: a discrete VAE (dVAE) compresses 256×256 RGB images into 32×32 image tokens, and a autoregressive Transformer predicts those tokens from text. CogView improves this by employing SentencePiece tokenization and techniques such as super‑resolution and style transfer during fine‑tuning.
ERNIE‑ViLG extends the Transformer to jointly learn text‑to‑image and image‑to‑text tasks.
Recent advances such as OFA unify multiple cross‑modal generation tasks, while diffusion models from Google enable high‑resolution image synthesis.
EasyNLP Text‑to‑Image Model
EasyNLP integrates a Transformer + VQGAN architecture for Chinese text‑to‑image generation, offering checkpoints of two sizes (202 M and 433 M parameters) that can be fine‑tuned with modest resources.
Model Architecture
The training follows a two‑stage process: first, VQGAN encodes images into discrete tokens (16×16 sequence, codebook size 16384); second, a GPT‑style Transformer autoregressively generates the image token sequence conditioned on the text tokens.
Open‑Source Model Parameters
pai‑painter‑base‑zh: 202 M parameters, 12 layers, 12 attention heads, hidden size 768
pai‑painter‑large‑zh: 433 M parameters, 24 layers, 16 attention heads, hidden size 1024
Both models use VQGAN f16_16384 pretrained on ImageNet, image size 256×256, text length 32, image token length 16×16
Implementation Code
self.first_stage_model = VQModel(ckpt_path=vqgan_ckpt_path).eval()
self.transformer = GPT(self.config) # encode_to_z
quant_z, _, info = self.first_stage_model.encode(x)
indices = info[2].view(quant_z.shape[0], -1)
return quant_z, indices # decode_to_img
bhwc = (zshape[0], zshape[2], zshape[3], zshape[1])
quant_z = self.first_stage_model.quantize.get_codebook_entry(index.reshape(-1), shape=bhwc)
x = self.first_stage_model.decode(quant_z)
return x # forward (training)
logits, _ = self.transformer(cz_indices[:, :-1])
logits = logits[:, c_indices.shape[1]-1:]
return logits, targetModel Effect
Evaluated on four Chinese public datasets (COCO‑CN, MUGE, Flickr8k‑CN, Flickr30k‑CN) and compared with CogView and DALL‑E.
Classic Cases
Examples on natural scenery (COCO‑CN) and e‑commerce items demonstrate the quality of both base and large models.
Usage Tutorial
Install EasyNLP
Follow the official installation guide (see reference [15]).
Data Preparation
Prepare TSV files with three columns: index, text, and base64‑encoded image. For testing, only index and text are needed.
import base64
from io import BytesIO
from PIL import Image
img = Image.open(fn)
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data)Model Training
easynlp \
--mode=train \
--worker_gpu=1 \
--tables=MUGE_val_text_imgbase64.tsv,MUGE_val_text_imgbase64.tsv \
--input_schema=idx:str:1,text:str:1,imgbase64:str:1 \
--first_sequence=text \
--second_sequence=imgbase64 \
--checkpoint_dir=./finetuned_model/ \
--learning_rate=4e-5 \
--epoch_num=1 \
--random_seed=42 \
--logging_steps=100 \
--save_checkpoint_steps=1000 \
--sequence_length=288 \
--micro_batch_size=16 \
--app_name=text2image_generation \
--user_defined_parameters='\
pretrain_model_name_or_path=alibaba-pai/pai-painter-large-zh\
size=256\
text_len=32\
img_len=256\
img_vocab_size=16384\
'Batch Inference
easynlp \
--mode=predict \
--worker_gpu=1 \
--tables=MUGE_test.text.tsv \
--input_schema=idx:str:1,text:str:1 \
--first_sequence=text \
--outputs=./T2I_outputs.tsv \
--output_schema=idx,text,gen_imgbase64 \
--checkpoint_dir=./finetuned_model/ \
--sequence_length=288 \
--micro_batch_size=8 \
--app_name=text2image_generation \
--user_defined_parameters='\
size=256\
text_len=32\
img_len=256\
img_vocab_size=16384\
'Pipeline Quick Demo
# Build pipeline
default_ecommercial_pipeline = pipeline("pai-painter-commercial-base-zh")
# Predict
data = ["宽松T恤"]
results = default_ecommercial_pipeline(data)
# Convert base64 to image
from PIL import Image
from io import BytesIO
import base64
def base64_to_image(imgbase64_str):
return Image.open(BytesIO(base64.urlsafe_b64decode(imgbase64_str)))
for text, result in zip(data, results):
img = base64_to_image(result['gen_imgbase64'])
img.save(f"{text}.png")
print(f"text: {text}, saved image: {text}.png")Future Outlook
EasyNLP will continue to release more Chinese multimodal models and integrate state‑of‑the‑art architectures for various NLP and vision‑language tasks. The community is invited to contribute and co‑build the next generation of open‑source Chinese AI tools.
Reference
Chengyu Wang et al., "EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing", arXiv.
Aditya Ramesh et al., "Zero‑Shot Text‑to‑Image Generation", ICML 2021.
Ming Ding et al., "CogView: Mastering Text‑to‑Image Generation via Transformers", NeurIPS 2021.
Han Zhang et al., "ERNIE‑ViLG: Unified Generative Pre‑training for Bidirectional Vision‑Language Generation", arXiv.
Peng Wang et al., "Unifying Architectures, Tasks, and Modalities Through a Simple Sequence‑to‑Sequence Learning Framework", ICML 2022.
Aditya Ramesh et al., "Hierarchical Text‑Conditional Image Generation with CLIP Latents", arXiv.
Van Den Oord & Vinyals, "Neural Discrete Representation Learning", NIPS 2017.
Esser et al., "Taming Transformers for High‑Resolution Image Synthesis", CVPR 2021.
Chitwan Saharia et al., "Photorealistic Text‑to‑Image Diffusion Models with Deep Language Understanding", arXiv.
Jiahui Yu et al., "Scaling Autoregressive Models for Content‑Rich Text‑to‑Image Generation", arXiv.
https://zhuanlan.zhihu.com/p/528476134
http://tianchi.aliyun.com/muge
https://github.com/THUDM/CogView
https://github.com/lucidrains/DALLE-pytorch
https://github.com/alibaba/EasyNLP
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
