GLM: General Language Model Pretraining with Autoregressive Blank Infilling
GLM introduces a unified pretraining framework that combines autoregressive blank‑filling with 2D positional encoding and span‑shuffle, achieving superior performance over BERT, T5 and GPT on a range of NLU and generation tasks such as SuperGLUE, text‑filling, and language modeling.
Abstract
Existing pre‑training architectures—auto‑encoding (e.g., BERT), autoregressive (e.g., GPT), and encoder‑decoder (e.g., T5)—focus on distinct tasks (NLU, unconditional generation, conditional generation) and none excels across all three. GLM proposes a general language model based on autoregressive blank‑infilling.
2D positional encoding and arbitrary‑order span prediction improve blank‑filling pre‑training.
Varying the number and length of blanks allows GLM to be pre‑trained for different downstream tasks.
Result: With the same model size and data, GLM outperforms BERT, T5, and GPT, achieving state‑of‑the‑art performance on diverse downstream tasks.
1. Introduction
Current pre‑training frameworks fall into three categories: autoregressive models, auto‑encoding models, and encoder‑decoder models. Autoregressive models (e.g., GPT) generate left‑to‑right but struggle with NLU due to unidirectional attention. Auto‑encoding models (e.g., BERT) excel at NLU but cannot generate text directly. Encoder‑decoder models (e.g., T5) handle conditional generation but require more parameters. GLM unifies these by using autoregressive blank‑filling.
2. Algorithmic Foundations
2.1 Autoregressive Blank Filling
GLM optimises an autoregressive blank‑filling objective. Given an input sequence x = [x₁, …, xₙ] , multiple spans s₁, …, sₘ are sampled and each span is replaced by a single [MASK] token, producing a corrupted sequence x_{corrupt} . The model predicts the missing tokens of each span in an arbitrary order, allowing it to capture inter‑span dependencies.
2.2 Multi‑Task Training
GLM is trained jointly on two objectives: (1) document‑level blank filling, where span lengths are sampled from 50‑100 % of the original length, and (2) sentence‑level blank filling, where whole sentences are masked to cover ~15 % of tokens. Both share the same loss formulation.
2.3 Model Architecture
GLM uses a single Transformer with three modifications: (1) reordered layer‑norm and residual connections to improve numerical stability, (2) a single linear head for token prediction, and (3) GeLU activation instead of ReLU.
2.4 2D Positional Encoding
Each token receives two position IDs: one for its location in the corrupted sequence and another for its position inside the masked span. Learnable embedding tables project both IDs to vectors that are added to the token embedding, enabling the model to handle spans of unknown length.
2.5 Fine‑Tuning
For NLU tasks, GLM reformulates classification as a blank‑filling generation problem (e.g., mapping label “positive” to the word “good”). The model predicts the masked token sequence and is fine‑tuned with cross‑entropy loss. For generation tasks, a mask token is appended to the context and the model autoregressively generates the target text.
2.6 Comparison with Other Models
Compared with BERT, XLNet, T5, and UniLM, GLM’s autoregressive blank‑filling captures span dependencies more effectively, avoids the need for multiple mask tokens, and benefits from 2D positional encoding.
3. Experiments
3.1 Pre‑training Setup
GLMBase (110 M) and GLMLarge (340 M) are trained on BooksCorpus and English Wikipedia using the same tokenizer as BERT (30 k vocab). Multi‑task variants (GLMDoc, GLMSent) combine document‑level and sentence‑level objectives. Larger models (GLM‑410M, GLM‑515M) increase depth and hidden size.
3.2 SuperGLUE Evaluation
GLM is fine‑tuned on eight SuperGLUE tasks (ReCoRD, COPA, WSC, RTE, BoolQ, WiC, CB, MultiRC). Across most tasks GLM outperforms BERTBase/BERTLarge; the only exception is WiC. GLMLarge surpasses BERTLarge by ~5 % average.
3.3 Multi‑Task Pre‑training
GLMDoc and GLMSent are evaluated on NLU, seq2seq, blank‑filling, and zero‑shot language modelling. GLMDoc/GLMSent perform slightly below GLMLarge but still beat BERTLarge and UniLM‑Large. Scaling up to 410 M and 515 M parameters further improves results.
3.4 Ablation Studies
Ablations show that (1) span‑shuffle is crucial—removing it degrades SuperGLUE performance dramatically, (2) using sentinel tokens instead of a single [MASK] harms performance, and (3) removing the second dimension of 2D positional encoding reduces long‑text generation quality.
3.5 Related Work
Discusses prior pre‑training paradigms (auto‑encoding, autoregressive, encoder‑decoder) and recent efforts to treat NLU as generation (e.g., PET, GPT‑3 prompting). Highlights GLM’s contribution of a unified autoregressive blank‑filling objective.
4. Conclusion
GLM provides a general pre‑training framework that unifies NLU and generation via autoregressive blank‑infilling, 2D positional encoding, and span‑shuffle, achieving superior performance on a wide range of benchmarks.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.