Artificial Intelligence 6 min read

ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features

During the Saturday session, we reviewed ChatGLM2’s upgrades—Multi‑Query Attention and FlashAttention—demonstrated deployment on Ascend + ModelArts + MindSpore, and introduced ChatGLM3’s revamped prompt design, native tool‑calling and code‑interpreter capabilities, while previewing the next lecture on text‑generation decoding.

Huawei Cloud Developer Alliance

Nov 16, 2023

ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features

Course Review

In the third class of the Saturday series, we explained the technical improvements from ChatGLM to ChatGLM2, demonstrated deployment of the ChatGLM2 chatbot on the OpenI Zhizhi Community Cloud Brain using Ascend + ModelArts + MindSpore, and introduced the newly open‑sourced ChatGLM3 features.

ChatGLM2 Technical Improvements

Multi-Query Attention (MQA)

Motivation: Improve the slow incremental inference speed of the Transformer decoder.

Method: Slightly modify Multi‑Head Attention so that all heads share a single pair of key and value matrices, reducing cache size.

Result: Decoder inference speed is effectively increased, though model quality is slightly lower than standard MHA.

Development: Evolved into Grouped‑Query Attention, where multiple queries share one key‑value pair, achieving higher speed than MHA and better quality than MQA.

FlashAttention

Motivation: Optimize at the hardware level to address the quadratic time and space complexity of self‑attention as sequence length grows.

Previous work: Sparse Transformers, low‑rank Transformers, which approximate attention but do not compute exact values.

Method: Reduce GPU memory access time by moving from HBM to faster SRAM, tiling the softmax computation and fusing operators.

Result: Training speed increased by more than three times.

Deployment on Ascend + ModelArts + MindSpore

Environment: MindSpore 2.0.0 with MindSpore‑Transformers development version. The inference procedure follows the same steps as the original ChatGLM deployment; detailed commands are available in the first lecture’s review material.

ChatGLM3 Features

New Prompt Design

Tool Call and Code Interpreter Modes

Tool mode: The model can invoke registered tools specified in the <|system|> token.

Code interpreter mode: Provides an execution environment for tasks such as drawing or mathematical computation.

Post‑Class Exercises

1. Run the ChatGLM2 inference deployment code and interact with the model.

2. Experiment with ChatGLM3’s various dialogue modes.

Upcoming Lecture

The fourth lecture of the second season of the MindSpore public course will be held on November 25 (Saturday) from 16:00 to 17:30, covering text‑generation decoding principles (sampling, beam search, etc.) with code demonstrations.

prompt engineering FlashAttention ChatGLM3 ChatGLM2 MindSpore Multi-Query Attention

Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Course Review

ChatGLM2 Technical Improvements

Multi-Query Attention (MQA)

FlashAttention

Deployment on Ascend + ModelArts + MindSpore

ChatGLM3 Features

New Prompt Design

Tool Call and Code Interpreter Modes

Post‑Class Exercises

Upcoming Lecture

Huawei Cloud Developer Alliance

How this landed with the community

Was this worth your time?

0 Comments

Deployment on Ascend + ModelArts + MindSpore