Advances in AIGC: AliceMind Text Generation Models and Multimodal mPLUG from Alibaba DAMO Academy
This article reviews recent AIGC progress, introducing the AliceMind series of text generation models—including PALM, PLUG, and a Chinese GPT‑3—alongside the multimodal mPLUG architecture, and discusses their training strategies, performance results, and practical deployment insights.
With the rise of ChatGPT, AIGC (AI‑generated content) has attracted widespread attention, driven by larger datasets, cheaper hardware, and the pre‑training paradigm. Alibaba DAMO Academy presents recent achievements in both text and multimodal generation.
AIGC Background : The breakthrough began with OpenAI's GPT‑3, which demonstrated strong few‑shot capabilities, and was further propelled by image generation models such as DALL‑E and DALL‑E 2.
AliceMind Text Generation Models : The series follows three stages. First, early encoder‑decoder models (e.g., BART, T5) gave way to large‑scale language models like GPT‑3, prompting the development of PALM (a hybrid auto‑encoding and auto‑regressive model). Subsequent large models (M6, Google PaLM) emphasized prompt‑based generation. InstructGPT introduced supervised instruction data to improve directive following, and reinforcement learning refined output quality.
The AliceMind lineup includes:
PALM – combines auto‑encoding and auto‑regressive pre‑training.
PLUG – a Chinese large‑scale model extending PALM with both NLU (StructBERT) and NLG capabilities.
Chinese GPT‑3 – a decoder‑only model trained on massive Chinese unsupervised corpora, offering fast inference (13 B parameters generate 128 tokens in ~1 s on allSpark).
Experiments show PALM 2.0’s curriculum learning (mask‑LM → text‑infilling & shuffle → auto‑regressive) improves accuracy across Chinese benchmarks, outperforming SOTA models on most datasets.
Multimodal Unified Generation Model mPLUG : Designed for image‑plus‑text inputs, mPLUG addresses the inefficiency of long visual token sequences by using asymmetric cross‑attention that first projects visual features into the text space, then merges them via a skip‑connection network. This architecture enables unified understanding and generation, supporting tasks such as VQA, COCO captioning, and image‑text retrieval.
Empirical results demonstrate that mPLUG achieves strong performance on VQA with only 14 M training images, and competitive scores on captioning and retrieval, while reducing training time compared to previous co‑attention designs.
Practical Deployment : All models (PALM 2.0, Chinese GPT‑3, PLUG, mPLUG) are released on ModelScope with model cards and checkpoints. Users can fine‑tune via provided pipelines, configure hyper‑parameters, and even run inference on free online notebooks. Training resource estimates range from 4‑5 days on 8 × A100 for base/large models to weeks on 32 × A100 for 1.3 B/2.7 B models.
Q&A Highlights : The Chinese GPT‑3 incorporates code data from Common Crawl and additional curated prompts; PLUG’s 20 B‑parameter version uses a mixture‑of‑experts (MoE) design rather than dense scaling; both Chinese GPT‑3 and ChatGPT share similar architectures but differ in instruction‑tuned data; training acceleration leverages NVIDIA’s Megatron optimizations.
The session concludes with thanks to the audience and information on how to access live demos, recordings, and further resources.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.