Artificial Intelligence 12 min read

QQGC: A Two-Stage Text-to-Image Model with Prior and Decoder Architectures for Efficient AI Painting

QQGC, Tencent’s two‑stage text‑to‑image model that separates CLIP‑based Prior mapping from a Stable Diffusion Decoder, leverages T5‑enhanced text embeddings and a suite of efficiency tricks—including FP16, flash attention, ZeRO and GPU‑RDMA—to train over‑2 B‑parameter models on 64 GPUs, achieving state‑of‑the‑art FID and CLIP scores while supporting image variation, semantic img2img, precise CLIP‑vector edits and unsafe‑content filtering, and now powers the company’s Magic Painting Room.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
QQGC: A Two-Stage Text-to-Image Model with Prior and Decoder Architectures for Efficient AI Painting

The article discusses the rapid growth of AIGC (AI‑Generated Content) driven by breakthroughs such as DALL·E 2 and Stable Diffusion, and introduces QQGC, a self‑researched text‑to‑image model developed by Tencent’s QQ Imaging Center.

QQGC adopts a two‑stage architecture: a Prior model that maps CLIP text embeddings (enhanced with a T5‑style language model) to CLIP image embeddings, and a Decoder model that generates images from those image embeddings by reusing the Stable Diffusion pipeline. This decoupling reduces training difficulty and improves generation quality.

To train the model efficiently under limited resources, the authors employ several acceleration techniques: data efficiency via a tar‑packaged dataloader, FP16 half‑precision training, activation checkpointing, ZeRO optimizer, flash attention (increasing single‑card batch size 8× and training speed 4×), GPU‑RDMA inter‑node communication, gradient accumulation, and optimizer optimizations, enabling training of >2B‑parameter models on a 64‑card cluster.

Experimental evaluation on COCO‑30k shows that QQGC achieves FID and CLIP‑score comparable to state‑of‑the‑art models. The model supports image variation, img2img with semantic fusion (combining prompt and reference image embeddings), and CLIP‑vector editing for precise operations such as watermark removal or insertion, while also allowing content‑control filtering for unhealthy inputs.

QQGC has been deployed as the foundation model for Tencent’s “Magic Painting Room” feature. Future work aims to strengthen identity preservation and style control in generated images, inviting community feedback.

diffusion modeltext-to-imageAI PaintingCLIP embeddingprior-decoder architectureTraining Acceleration
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.