Artificial Intelligence 27 min read

GLM: General Language Model Pretraining with Autoregressive Blank Infilling

GLM introduces a unified pretraining framework that combines autoregressive blank‑filling with 2D positional encoding and span‑shuffle, achieving superior performance over BERT, T5 and GPT on a range of NLU and generation tasks such as SuperGLUE, text‑filling, and language modeling.

JD Tech Talk

Mar 5, 2025

GLM: General Language Model Pretraining with Autoregressive Blank Infilling

Abstract

Existing pre‑training architectures—auto‑encoding (e.g., BERT), autoregressive (e.g., GPT), and encoder‑decoder (e.g., T5)—focus on distinct tasks (NLU, unconditional generation, conditional generation) and none excels across all three. GLM proposes a general language model based on autoregressive blank‑infilling.

2D positional encoding and arbitrary‑order span prediction improve blank‑filling pre‑training.

Varying the number and length of blanks allows GLM to be pre‑trained for different downstream tasks.

Result: With the same model size and data, GLM outperforms BERT, T5, and GPT, achieving state‑of‑the‑art performance on diverse downstream tasks.

1. Introduction

Current pre‑training frameworks fall into three categories: autoregressive models, auto‑encoding models, and encoder‑decoder models. Autoregressive models (e.g., GPT) generate left‑to‑right but struggle with NLU due to unidirectional attention. Auto‑encoding models (e.g., BERT) excel at NLU but cannot generate text directly. Encoder‑decoder models (e.g., T5) handle conditional generation but require more parameters. GLM unifies these by using autoregressive blank‑filling.

2. Algorithmic Foundations

2.1 Autoregressive Blank Filling

GLM optimises an autoregressive blank‑filling objective. Given an input sequence x = [x₁, …, xₙ], multiple spans s₁, …, sₘ are sampled and each span is replaced by a single [MASK] token, producing a corrupted sequence x_{corrupt}. The model predicts the missing tokens of each span in an arbitrary order, allowing it to capture inter‑span dependencies.

2.2 Multi‑Task Training

GLM is trained jointly on two objectives: (1) document‑level blank filling, where span lengths are sampled from 50‑100 % of the original length, and (2) sentence‑level blank filling, where whole sentences are masked to cover ~15 % of tokens. Both share the same loss formulation.

2.3 Model Architecture

GLM uses a single Transformer with three modifications: (1) reordered layer‑norm and residual connections to improve numerical stability, (2) a single linear head for token prediction, and (3) GeLU activation instead of ReLU.

2.4 2D Positional Encoding

Each token receives two position IDs: one for its location in the corrupted sequence and another for its position inside the masked span. Learnable embedding tables project both IDs to vectors that are added to the token embedding, enabling the model to handle spans of unknown length.

2.5 Fine‑Tuning

For NLU tasks, GLM reformulates classification as a blank‑filling generation problem (e.g., mapping label “positive” to the word “good”). The model predicts the masked token sequence and is fine‑tuned with cross‑entropy loss. For generation tasks, a mask token is appended to the context and the model autoregressively generates the target text.

2.6 Comparison with Other Models

Compared with BERT, XLNet, T5, and UniLM, GLM’s autoregressive blank‑filling captures span dependencies more effectively, avoids the need for multiple mask tokens, and benefits from 2D positional encoding.

3. Experiments

3.1 Pre‑training Setup

GLMBase (110 M) and GLMLarge (340 M) are trained on BooksCorpus and English Wikipedia using the same tokenizer as BERT (30 k vocab). Multi‑task variants (GLMDoc, GLMSent) combine document‑level and sentence‑level objectives. Larger models (GLM‑410M, GLM‑515M) increase depth and hidden size.

3.2 SuperGLUE Evaluation

GLM is fine‑tuned on eight SuperGLUE tasks (ReCoRD, COPA, WSC, RTE, BoolQ, WiC, CB, MultiRC). Across most tasks GLM outperforms BERTBase/BERTLarge; the only exception is WiC. GLMLarge surpasses BERTLarge by ~5 % average.

3.3 Multi‑Task Pre‑training

GLMDoc and GLMSent are evaluated on NLU, seq2seq, blank‑filling, and zero‑shot language modelling. GLMDoc/GLMSent perform slightly below GLMLarge but still beat BERTLarge and UniLM‑Large. Scaling up to 410 M and 515 M parameters further improves results.

3.4 Ablation Studies

Ablations show that (1) span‑shuffle is crucial—removing it degrades SuperGLUE performance dramatically, (2) using sentinel tokens instead of a single [MASK] harms performance, and (3) removing the second dimension of 2D positional encoding reduces long‑text generation quality.

3.5 Related Work

Discusses prior pre‑training paradigms (auto‑encoding, autoregressive, encoder‑decoder) and recent efforts to treat NLU as generation (e.g., PET, GPT‑3 prompting). Highlights GLM’s contribution of a unified autoregressive blank‑filling objective.

4. Conclusion

GLM provides a general pre‑training framework that unifies NLU and generation via autoregressive blank‑infilling, 2D positional encoding, and span‑shuffle, achieving superior performance on a wide range of benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

pretraining Language Model 2D positional encoding autoregressive blank filling GLM NLU

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.