Mengzi Lightweight Model Technology System and Advances in Small‑Scale and Retrieval‑Augmented Pretraining
This presentation introduces the Mengzi lightweight model technology stack, covering large‑scale pre‑training, motivations for lightweight models, detailed techniques such as knowledge and sequence‑relation enhancement, training optimization, model compression, retrieval‑augmented pre‑training, multimodal extensions, open‑source releases, and real‑world applications.
The talk, delivered by Wang Yulong from Lan Zhou Technology, outlines a four‑part agenda: large‑scale pre‑training models, reasons for training lightweight models, the Mengzi lightweight technology system, and open‑source release of Mengzi lightweight pre‑training models.
Large‑scale pre‑training has become a core NLP technique since the introduction of Transformers in 2017, with models like BERT, GPT, and T5 leveraging self‑supervised learning to learn language representations from massive unlabeled corpora and then fine‑tune on downstream tasks.
Lightweight models are pursued because scaling model parameters incurs prohibitive training and deployment costs, and hardware progress cannot keep up with model growth; thus, model size reduction is essential for latency‑sensitive online services.
The Mengzi lightweight system consists of three main modules: (1) Knowledge enhancement (e.g., linguistic signals such as POS tags and NER to guide the model), (2) Sequence‑relation enhancement (e.g., SOP task from ALBERT), and (3) Training optimization (e.g., denoising‑based mask construction, loss weighting based on mask damage and semantic distance).
Model compression techniques include distillation (using MiniLM‑style attention KL‑divergence), structured pruning (head pruning inspired by Early‑Bird and ROSITA), and quantization, forming an end‑to‑end pipeline that significantly reduces parameter count while preserving performance.
Retrieval‑augmented pre‑training unifies various approaches (REALM, RAG, RETRO, KNN‑LM) by integrating external knowledge bases via a retrieval service that supplies context during inference, enabling smaller models to achieve performance comparable to much larger ones.
Multi‑task learning is achieved by converting diverse downstream tasks into a unified text‑to‑text format using prompts, allowing a single T5‑based model (Mengzi‑T5‑base‑MT) trained on 27 datasets (≈300 prompts) to handle tasks such as sentiment classification, entity extraction, and financial relation extraction.
Multimodal experiments extend the framework to Stable Diffusion‑based image generation, including Chinese‑style painting transfer (Guohua Diffusion) and prompt‑expansion models that simplify user input.
All models (various sizes and task‑specific variants) have been open‑sourced on HuggingFace and ModelScope, with SDKs supporting eight downstream tasks; the Mengzi‑T5‑base‑MT achieved top ranking on the ZeroCLUE leaderboard despite having only 0.2 B parameters.
Real‑world deployments span machine translation, text generation, search engines, and financial information extraction, demonstrating reduced R&D costs, easier domain transfer, and competitive performance in production settings.
The session concludes with a Q&A covering topics such as POS‑guided pre‑training, pseudo‑labeling, knowledge‑graph integration, and future plans for dialogue tasks.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.