ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features

During the Saturday session, we reviewed ChatGLM2’s upgrades—Multi‑Query Attention and FlashAttention—demonstrated deployment on Ascend + ModelArts + MindSpore, and introduced ChatGLM3’s revamped prompt design, native tool‑calling and code‑interpreter capabilities, while previewing the next lecture on text‑generation decoding.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features

Course Review

In the third class of the Saturday series, we explained the technical improvements from ChatGLM to ChatGLM2, demonstrated deployment of the ChatGLM2 chatbot on the OpenI Zhizhi Community Cloud Brain using Ascend + ModelArts + MindSpore, and introduced the newly open‑sourced ChatGLM3 features.

ChatGLM2 Technical Improvements

Multi-Query Attention (MQA)

Motivation: Improve the slow incremental inference speed of the Transformer decoder.

Method: Slightly modify Multi‑Head Attention so that all heads share a single pair of key and value matrices, reducing cache size.

Result: Decoder inference speed is effectively increased, though model quality is slightly lower than standard MHA.

Development: Evolved into Grouped‑Query Attention, where multiple queries share one key‑value pair, achieving higher speed than MHA and better quality than MQA.

FlashAttention

Motivation: Optimize at the hardware level to address the quadratic time and space complexity of self‑attention as sequence length grows.

Previous work: Sparse Transformers, low‑rank Transformers, which approximate attention but do not compute exact values.

Method: Reduce GPU memory access time by moving from HBM to faster SRAM, tiling the softmax computation and fusing operators.

Result: Training speed increased by more than three times.

Deployment on Ascend + ModelArts + MindSpore

Environment: MindSpore 2.0.0 with MindSpore‑Transformers development version. The inference procedure follows the same steps as the original ChatGLM deployment; detailed commands are available in the first lecture’s review material.

ChatGLM3 Features

New Prompt Design

ChatGLM3 introduces a completely new prompt format and supports native tool calls, code interpretation, and agent tasks. Four special tokens are added: <|system|> (system role), <|user|> (user input), <|assistant|> (model output), and <|observation|> (external tool result).

Tool Call and Code Interpreter Modes

Tool mode: The model can invoke registered tools specified in the <|system|> token.

Code interpreter mode: Provides an execution environment for tasks such as drawing or mathematical computation.

Post‑Class Exercises

1. Run the ChatGLM2 inference deployment code and interact with the model.

2. Experiment with ChatGLM3’s various dialogue modes.

Upcoming Lecture

The fourth lecture of the second season of the MindSpore public course will be held on November 25 (Saturday) from 16:00 to 17:30, covering text‑generation decoding principles (sampling, beam search, etc.) with code demonstrations.

prompt engineeringFlashAttentionChatGLM3ChatGLM2MindSporeMulti-Query Attention
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.