Artificial Intelligence 16 min read

Advances in AIGC: AliceMind Text Generation Models and Multimodal mPLUG from Alibaba DAMO Academy

This article reviews recent AIGC progress, introducing the AliceMind series of text generation models—including PALM, PLUG, and a Chinese GPT‑3—alongside the multimodal mPLUG architecture, and discusses their training strategies, performance results, and practical deployment insights.

DataFunTalk

Mar 4, 2023

Advances in AIGC: AliceMind Text Generation Models and Multimodal mPLUG from Alibaba DAMO Academy

With the rise of ChatGPT, AIGC (AI‑generated content) has attracted widespread attention, driven by larger datasets, cheaper hardware, and the pre‑training paradigm. Alibaba DAMO Academy presents recent achievements in both text and multimodal generation.

AIGC Background : The breakthrough began with OpenAI's GPT‑3, which demonstrated strong few‑shot capabilities, and was further propelled by image generation models such as DALL‑E and DALL‑E 2.

AliceMind Text Generation Models : The series follows three stages. First, early encoder‑decoder models (e.g., BART, T5) gave way to large‑scale language models like GPT‑3, prompting the development of PALM (a hybrid auto‑encoding and auto‑regressive model). Subsequent large models (M6, Google PaLM) emphasized prompt‑based generation. InstructGPT introduced supervised instruction data to improve directive following, and reinforcement learning refined output quality.

The AliceMind lineup includes:

PALM – combines auto‑encoding and auto‑regressive pre‑training.

PLUG – a Chinese large‑scale model extending PALM with both NLU (StructBERT) and NLG capabilities.

Chinese GPT‑3 – a decoder‑only model trained on massive Chinese unsupervised corpora, offering fast inference (13 B parameters generate 128 tokens in ~1 s on allSpark).

Experiments show PALM 2.0’s curriculum learning (mask‑LM → text‑infilling & shuffle → auto‑regressive) improves accuracy across Chinese benchmarks, outperforming SOTA models on most datasets.

Multimodal Unified Generation Model mPLUG : Designed for image‑plus‑text inputs, mPLUG addresses the inefficiency of long visual token sequences by using asymmetric cross‑attention that first projects visual features into the text space, then merges them via a skip‑connection network. This architecture enables unified understanding and generation, supporting tasks such as VQA, COCO captioning, and image‑text retrieval.

Empirical results demonstrate that mPLUG achieves strong performance on VQA with only 14 M training images, and competitive scores on captioning and retrieval, while reducing training time compared to previous co‑attention designs.

Practical Deployment : All models (PALM 2.0, Chinese GPT‑3, PLUG, mPLUG) are released on ModelScope with model cards and checkpoints. Users can fine‑tune via provided pipelines, configure hyper‑parameters, and even run inference on free online notebooks. Training resource estimates range from 4‑5 days on 8 × A100 for base/large models to weeks on 32 × A100 for 1.3 B/2.7 B models.

Q&A Highlights : The Chinese GPT‑3 incorporates code data from Common Crawl and additional curated prompts; PLUG’s 20 B‑parameter version uses a mixture‑of‑experts (MoE) design rather than dense scaling; both Chinese GPT‑3 and ChatGPT share similar architectures but differ in instruction‑tuned data; training acceleration leverages NVIDIA’s Megatron optimizations.

The session concludes with thanks to the audience and information on how to access live demos, recordings, and further resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AIGC pretraining Multimodal Generation AliceMind mPLUG

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.