OBERT: A Billion‑Parameter Pretrained Language Model for Large‑Scale NLP Applications
The OPPO XiaoBu team introduced OBERT, a series of 100M‑, 300M‑, and 1B‑parameter pretrained language models that leverage massive TB‑scale corpora, multi‑granular masking, retrieval‑augmented training, and distributed acceleration to achieve state‑of‑the‑art results on CLUE and KgCLUE benchmarks while enabling efficient industrial deployment.
Large‑scale pretrained models have reshaped natural language processing, and since 2020 the OPPO XiaoBu Assistant team has been developing its own models—OBERT—at 100 M, 300 M and 1 B parameters to meet industrial scalability requirements.
The latest 1 B‑parameter OBERT was trained with five mask mechanisms on terabytes of data, delivering more than a 4 % boost in business metrics and ranking 5th on the CLUE 1.1 leaderboard and 1st on the KgCLUE 1.0 knowledge‑graph QA list, comparable to 100 B‑parameter models while using only a tenth of the parameters.
Background: the “pretrain‑plus‑fine‑tune” paradigm has already been applied to intent classification, multi‑turn dialogue, text matching and other XiaoBu scenarios; scaling to a 1 B model was deemed necessary to further improve performance.
Development pipeline: the team follows a four‑stage process—Pretraining, Further‑Pretraining, Fine‑tuning & Deployment—characterized by decoupled representations that serve both understanding and generation tasks, retrieval‑augmented learning, multi‑stage curriculum training, and model sizes focused on the 100 M‑300 M‑1 B range.
Pretraining data: 1.6 TB of cleaned corpus covering encyclopedic articles, community Q&A and news were collected; a dedicated preprocessing workflow (see Fig. 4) was applied.
Pretraining task: following prior work, the masked language modeling (MLM) objective was chosen for its superior performance on downstream NLU tasks.
Mask strategies: (1) coarse‑grained masks (full‑word, entity, keyword, phrase, short‑sentence) to learn representations at multiple granularities; (2) knowledge‑enhanced masks that inject encyclopedia triples (Entity, Description, Content) as external context, similar to REALM, memory‑augmented LM and ERNIE 3.0 approaches.
Effectiveness: experiments on the 100 M model showed significant zero‑shot gains over open‑source baselines, and the same strategies were successfully transferred to the 1 B model.
Training acceleration: the team tackled GPU memory limits and compute efficiency by employing mixed parallelism (data + model + ZeRO optimizer), topology‑aware communication optimizations, and gradient accumulation, achieving over a 29 % increase in throughput compared with a baseline.
Fine‑tuning strategies: for CLUE 1.1 tasks, a fine‑tuning framework incorporating adversarial training (FGM, PGD), regularized dropout (R‑Drop), noise injection (Noise‑tune), multi‑sample dropout, and lexical auxiliary tasks was built; for KgCLUE 1.0, the voice‑assistant knowledge‑QA pipeline was reused, combining NER + similarity and generation methods, attaining first place with a single model.
Deployment scheme: a unified representation approach was designed for multiple NLU services, consisting of (1) pretraining, (2) multi‑task fine‑tuning where only the bottom N transformer layers are updated (top M layers frozen), and (3) multi‑task merging at inference, reducing overall compute to roughly 27 % of full fine‑tuning.
Future directions: the team plans to further explore retrieval‑augmented pretraining for short texts, construct unsupervised tasks from user feedback, and investigate model lightweighting techniques to accelerate large‑model deployment.
Team introduction: OPPO XiaoBu Assistant team focuses on AI‑driven user experiences across smartphones and IoT devices, covering speech recognition, semantic understanding, dialogue generation, knowledge QA, recommendation, digital humans and multimodal interaction; the OPPO ML platform provides end‑to‑end AI development services for a wide range of applications.
References: [1] Cloze‑driven Pretraining of Self‑attention Networks; [2] Pre‑trained Models for Natural Language Processing: A Survey; [3] YUAN 1.0; [4] REALM; [5] Training Language Models with Memory Augmentation; [6] ERNIE 3.0; [7] Curriculum Learning for Billion‑Scale GPT; [8] ZeRO; [9] FGM; [10] Adversarially Robust Deep Models; [11] R‑Drop; [12] NoisyTune; [13] Multi‑Sample Dropout; [14] Mengzi.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.