Artificial Intelligence 25 min read

ChatGLM Evolution: Deep Dive into GLM Architecture, Pretraining, and ChatGLM‑4

This article provides a comprehensive technical overview of the ChatGLM series—from the original ChatGLM‑6B model and its GLM‑based pre‑training framework to the enhancements in ChatGLM‑2, the architectural parity of ChatGLM‑3, and the advanced capabilities of the latest ChatGLM‑4, covering model structure, position encoding, attention mechanisms, multi‑task pretraining, and tool integration.

Architect's Alchemy Furnace

Jul 6, 2024

ChatGLM Evolution: Deep Dive into GLM Architecture, Pretraining, and ChatGLM‑4

Introduction

ChatGLM‑6B is an open‑source bilingual (Chinese‑English) dialogue language model based on the General Language Model (GLM) architecture with 6.2 billion parameters. It adopts techniques similar to ChatGPT, including supervised fine‑tuning, reinforcement learning from human feedback, and parameter‑efficient fine‑tuning, enabling it to generate responses aligned with human preferences.

For downstream developers, GLM supports efficient fine‑tuning via P‑Tuning v2 and can be trained with as little as 7 GB of GPU memory under INT4 quantization.

1. ChatGLM Foundations

1.1 Background

Three mainstream pre‑training paradigms exist:

Autoregressive (AR) models such as GPT, which predict the next token in a left‑to‑right fashion and excel at generative tasks.

Autoencoding (AE) models such as BERT, which mask tokens and reconstruct them, providing strong contextual representations for understanding tasks.

Encoder‑decoder (Seq2seq) models such as T5, which combine bidirectional encoding with conditional generation.

Each paradigm has strengths and weaknesses; none dominates across natural language understanding (NLU), unconditional generation, and conditional generation.

Autoregressive Model

AR models learn a factorized probability distribution by predicting the next token given previous tokens. They are powerful for long‑text generation but cannot capture bidirectional context.

Autoencoding Model

AE models like BERT reconstruct masked tokens, enabling bidirectional context but lacking direct generative capability.

Encoder‑Decoder Model

Encoder‑decoder models treat tasks as sequence‑to‑sequence transformations, supporting both understanding and generation (e.g., machine translation).

2. GLM Pre‑training Framework

GLM combines the three paradigms via an autoregressive blank‑infilling objective. Input sequences are split into Part A (corrupted text) and Part B (masked spans). Part A tokens can see each other but not Part B; Part B tokens can see Part A and previously generated tokens within Part B. Span order is shuffled to encourage inter‑span dependencies.

Key techniques:

Two‑dimensional position encoding : one dimension encodes absolute position in the corrupted text, the other encodes intra‑span relative position.

Custom attention mask : Part A uses full bidirectional attention, Part B uses causal attention, and Part B can attend to Part A.

Span sampling : spans are sampled according to a Poisson distribution until ~15 % of tokens are masked.

During training, each masked span is wrapped with special tokens [S] (start) and [E] (end). The model thus learns a unified encoder (for Part A) and decoder (for Part B).

Multi‑Task Pre‑training

GLM is trained simultaneously on NLU, unconditional generation, and conditional generation tasks, enabling a single model to handle classification, question answering, summarization, and dialogue generation.

3. Model Variants

3.1 ChatGLM‑2

Longer context : context length extended from 2 K to 32 K tokens using FlashAttention; 8 K context used during dialogue training.

Performance boost : mixed objective training on 1.4 T tokens improves MMLU, CEval, GSM8K, and BBH scores.

Efficient inference : Multi‑Query Attention reduces computation and memory; INT4 quantization supports 8 K context with 6 GB GPU.

Open licensing : weights are fully open for academic research and free commercial use after registration.

3.2 ChatGLM‑3

ChatGLM‑3 shares the same architecture as ChatGLM‑2; the main differences from the original ChatGLM are a reduced vocabulary (65 024 tokens), global position encoding, and a Swish‑1 activation in the feed‑forward network.

3.3 ChatGLM‑4

Released in January 2024, GLM‑4 achieves performance close to GPT‑4 on benchmarks such as MMLU, GSM8K, MATH, and BBH. It supports much longer contexts (up to 128 K tokens), multimodal generation (via CogView 3), and an “All Tools” suite that can automatically invoke web browsing, code interpretation, image generation, and function calls to solve complex tasks.

Key capabilities include:

Instruction following : near‑GPT‑4 level on IFEval and instruction benchmarks.

Alignment : superior Chinese alignment compared to GPT‑4.

Long‑text retrieval : 100 % recall on 128 K “needle‑in‑a‑haystack” tests.

Multimodal generation : CogView 3 produces images comparable to DALL·E 3.

Tool integration : automatic planning and execution of web browsing, code execution, and function calls.

4. Finetuning

Finetuning adapts the pretrained GLM to specific downstream tasks by loading the base model, optionally freezing layers, adding task‑specific heads, and training on task data with appropriate hyper‑parameters.

5. Model Architecture Comparisons

Figures (omitted) illustrate the structural differences among ChatGLM‑6B, ChatGLM‑2, and ChatGLM‑4, highlighting changes in attention masks, position encoding, and feed‑forward network design.

AI large language model ChatGLM pretraining GLM

Written by

Architect's Alchemy Furnace

A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.