How GLM’s Autoregressive Blank‑Filling Beats BERT, T5, and GPT
GLM introduces a universal language model that combines autoregressive blank‑filling with 2D positional encoding and span‑shuffle training, achieving superior performance over BERT, T5, and GPT across NLU, conditional and unconditional generation tasks, as demonstrated on SuperGLUE and other benchmarks.
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., & Tang, J. (2022). GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Long Papers), 320‑335.
Abstract
Existing pre‑training architectures—auto‑encoding (e.g., BERT), autoregressive (e.g., GPT) and encoder‑decoder (e.g., T5)—are each specialized for NLU, unconditional generation, or conditional generation and none dominates all three.
GLM proposes a General Language Model (GLM) based on autoregressive blank‑filling.
2D positional encoding and span‑shuffle prediction improve blank‑filling pre‑training.
Varying the number and length of blanks allows GLM to be pre‑trained for different task types.
With comparable model size and data, GLM outperforms BERT, T5 and GPT, achieving state‑of‑the‑art results on a range of downstream tasks.
Remarks
Natural Language Understanding: text classification, tokenization, syntactic parsing, information extraction, etc.
Conditional Generation: generate new text given context or templates (e.g., translation, QA).
Unconditional Generation: sample new text from a given corpus.
Autoregressive models cannot see future tokens during training.
1. Introduction
Existing pre‑training frameworks can be divided into three categories: autoregressive models, auto‑encoding models, and encoder‑decoder models.
Autoregressive models such as GPT learn a left‑to‑right language model; they excel at long‑text generation but their single‑direction attention limits NLU performance.
Auto‑encoding models such as BERT use a denoising objective (MLM) to learn bidirectional contextual encoders, which are effective for NLU but not directly applicable to text generation.
Encoder‑decoder models such as T5 combine bidirectional encoding with unidirectional decoding and are typically used for conditional generation tasks.
None of these frameworks is flexible enough to achieve competitive results on all NLP tasks. To address this, we propose GLM, a unified pre‑training framework based on autoregressive blank‑filling.
2. Algorithmic Foundations
2.1 Autoregressive Blank‑Filling
GLM is trained by optimizing an autoregressive blank‑filling objective. Given an input sequence x = [x1,…,xn], a set of spans s1,…,sm is sampled; each span si is replaced by a single [MASK] token, producing a corrupted sequence x_corrupt. The model predicts the missing tokens of each span in an autoregressive manner, allowing it to attend to previously predicted spans. Span order is randomly shuffled to capture inter‑span dependencies, similar to the permutation language model.
Formally, let Zm be the set of all permutations of {1,…,m}. For a permutation z∈Zm, the training objective factorizes the probability of each span si according to the shuffled order.
2.2 Multi‑Task Training
GLM masks short spans for NLU tasks, but we also train a single model that can handle both NLU and text generation. We introduce a multi‑task pre‑training setup that jointly optimizes the blank‑filling objective with a second objective that generates longer text. Two variants are considered:
Document‑level: a single span of length sampled uniformly from 50%–100% of the original length is generated.
Sentence‑level: mask spans correspond to whole sentences, covering 15% of the original tokens, targeting seq2seq tasks.
The two new objectives share the same loss formulation as Equation 1, differing only in span number and length.
2.3 Model Architecture
GLM uses a single Transformer with several modifications: (1) the order of layer normalization and residual connections is rearranged to mitigate numerical errors in large models; (2) a single linear layer predicts output tokens; (3) GeLU replaces ReLU as the activation function.
2.4 2D Positional Encoding
To encode positional information for the autoregressive blank‑filling task, each token receives two position IDs. The first ID denotes the token’s position in the corrupted sequence x_corrupt (or the position of the [MASK] token for a masked span). The second ID indicates the token’s position within its span (0 for tokens in part A, 1…L for tokens in part B). Both IDs are projected via learnable embedding tables and added to the token embeddings. This design prevents the model from knowing the length of a masked span during reconstruction, unlike XLNet or SpanBERT.
Our encoding is well‑suited for downstream tasks where the length of generated text is unknown.
2.5 GLM Fine‑Tuning
For downstream NLU tasks, a linear classifier consumes the sequence or token representations produced by the pre‑trained model. To avoid the mismatch between pre‑training and fine‑tuning, we reformulate NLU classification as a blank‑filling generation task following PET. For example, sentiment classification is expressed as “{SENTENCE} … [MASK]”, where the mask is filled with “good” or “bad”. The conditional probability of label y given input x is proportional to the probability of generating the corresponding word. Cross‑entropy loss is used to fine‑tune GLM.
2.6 Comparison with Other Models
We compare GLM with BERT, XLNet, T5, and UniLM. BERT cannot capture dependencies between masked tokens and requires enumerating possible answer lengths. XLNet uses original positional encodings and a bi‑stream attention mechanism, doubling training cost. T5’s encoder‑decoder design wastes capacity by using multiple sentinel tokens for masked spans. UniLM mixes attention masks but still relies on [MASK] replacement, limiting its ability to model span‑context dependencies. GLM’s unified autoregressive blank‑filling and 2D positional encoding address these limitations.
3. Experiments
3.1 Pre‑Training Setup
We pre‑train GLMBase (110 M parameters) and GLMLarge (340 M) on BooksCorpus and English Wikipedia using the same tokenizer and architecture as BERT. Multi‑task pre‑training combines blank‑filling with document‑level or sentence‑level objectives, yielding GLMDoc and GLMSent. Larger models (GLM410M, GLM515M) and a RoBERTa‑matched GLMRoBERTa are also trained.
3.2 Evaluation on SuperGLUE
We evaluate GLM on the SuperGLUE benchmark (8 challenging NLU tasks). Classification tasks are reformulated as blank‑filling problems using PET. GLM consistently outperforms BERTBase/BERTLarge and, in most cases, RoBERTaLarge, with notable gains on ReCoRD and WSC.
3.3 Multi‑Task Pre‑Training Results
GLMDoc and GLMSent achieve comparable performance to GLMLarge while surpassing BERTLarge and UniLM‑Large. Scaling GLMDoc to 410 M parameters further improves results, and the 515 M model exceeds BERTLarge by a larger margin.
3.4 Ablation Studies
We conduct ablations to assess the impact of span‑shuffle, sentinel tokens, and 2D positional encoding. Removing span‑shuffle leads to a severe drop in SuperGLUE performance. Replacing the single [MASK] with multiple sentinel tokens also degrades results. Omitting the second dimension of the 2D positional encoding harms long‑text generation.
3.5 Related Work
We review prior pre‑training paradigms: auto‑encoding (BERT, RoBERTa), autoregressive (GPT, GPT‑2/3), and encoder‑decoder (T5, BART, UniLM). Recent work demonstrates that NLU can be cast as a generation problem, enabling unified models like GLM to handle both understanding and generation tasks.
4. Conclusion
GLM is a universal pre‑training framework that unifies NLU and generation via autoregressive blank‑filling, 2D positional encoding, and span‑shuffle training. Empirically, GLM outperforms previous methods on a wide range of benchmarks while sharing parameters across tasks.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.