How RoPE and FlashAttention Empower GLM-4-Plus for Long-Text Mastery
This article explains the core mechanisms of Transformer models, details the Rotational Position Embedding (RoPE) and FlashAttention techniques for handling long sequences, introduces the GLM-4-Plus series, and presents an empirical evaluation on the THUCNews dataset showing its superior long-text performance.
Transformer Model Overview
The Transformer architecture is built around the self‑attention mechanism, which allows every token in a sequence to attend to all other tokens simultaneously. This contrasts with recurrent or convolutional networks that process tokens sequentially.
Positional Encoding
Because self‑attention has no intrinsic notion of order, Transformers add positional encodings that are added to the token embeddings. The classic sinusoidal encoding uses fixed sine and cosine functions:
For even dimension 2i: PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) For odd dimension 2i+1: PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model}) The dimensionality of the encoding matches the model dimension.
Parallelism and Residual Connections
Since each self‑attention layer does not depend on previous token outputs, all layers can be computed in parallel, greatly accelerating training. Residual connections and layer‑normalization are added around each sub‑layer to stabilize deep networks and mitigate gradient issues.
Self‑Attention Mechanism
For each token the model computes three vectors via linear projections: query (Q), key (K) and value (V). Attention scores are obtained by the scaled dot‑product of Q and K, followed by a softmax to produce attention weights that are applied to V:
Attention(Q, K, V) = softmax((Q·K^T) / sqrt(d_k)) · VLong‑Text Processing Techniques
Rotary Position Embedding (RoPE)
RoPE extends positional encoding to arbitrary sequence lengths by rotating token embeddings with a position‑dependent angle. The rotation preserves the magnitude of the embedding vector while injecting relative position information, enabling the model to generalize to longer contexts than seen during training.
https://huggingface.co/THUDM/glm-4-9b/blob/main/modeling_chatglm.py
@torch.jit.script
def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
b, np, sq, hn = x.size(0), x.size(1), x.size(2), x.size(3)
rot_dim = rope_cache.shape[-2] * 2
x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
rope_cache = rope_cache[:, :sq]
xshaped = x.reshape(b, np, sq, rot_dim // 2, 2)
rope_cache = rope_cache.view(-1, 1, sq, xshaped.size(3), 2)
x_out2 = torch.stack([
xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1]
], -1)
x_out2 = x_out2.flatten(3)
return torch.cat((x_out2, x_pass), dim=-1)FlashAttention
FlashAttention reduces the memory complexity of self‑attention from O(N²) to O(N) by tiling the attention matrix into small blocks and recomputing softmax statistics on‑the‑fly. This makes it possible to process very long sequences with limited GPU memory.
https://huggingface.co/THUDM/glm-4-9b/blob/main/modeling_chatglm.py
GLM‑4‑Plus Evaluation on Long‑Text Tasks
The GLM‑4 family was evaluated on the THUCNews news‑classification dataset with varying input lengths to study the impact of context window size on accuracy.
Model Variants
GLM‑4‑Plus – flagship model, 128K context
GLM‑4‑0520 – high‑intelligence model, 128K context
GLM‑4‑Long Beta – ultra‑long input, 1 M context
GLM‑4‑AirX – fast inference, 8K context
GLM‑4‑Air – cost‑effective, 128K context
GLM‑4‑Flash – free access, 128K context
Evaluation Method
Each news article is truncated to a specified length and sent to the model via a chat‑completion API. The model’s response is checked against the ground‑truth label.
def news_classify(news_path, texlen, model):
text = ''.join(open(news_path).readlines())[:texlen]
data = {
"model": model,
"messages": [{"role": "user",
"content": f'''请对下面的新闻进行分类,待选类别有: {class_names}
{text}''' }]
}
try:
response = requests.post(url, headers=headers, json=data, timeout=200)
if news_path.split('/')[-2] in response.json()['choices'][0]['message']['content']:
return True
else:
return False
except:
return NoneResults
GLM‑4‑Plus consistently achieved classification accuracy above 0.88 on long‑text inputs, outperforming the other GLM‑4 variants.
GLM‑4‑Plus Capabilities
Enhanced language understanding through massive synthetic data, improving reasoning on mathematics and code.
Long‑text inference enabled by precise short‑long data mixing and advanced pre‑training techniques.
API Usage Examples
Synchronous Call
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="") # your API key
response = client.chat.completions.create(
model="glm-4-plus",
messages=[
{"role": "user", "content": "作为一名营销专家,请为智谱开放平台创作一个吸引人的slogan"},
{"role": "assistant", "content": "当然,为了创作一个吸引人的slogan,请告诉我一些关于您产品的信息"},
{"role": "user", "content": "智谱AI开放平台"},
{"role": "assistant", "content": "智启未来,谱绘无限一智谱AI,让创新触手可及!"},
{"role": "user", "content": "创造一个更精准、吸引人的slogan"}
]
)
print(response.choices[0].message)Streaming Call
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="")
response = client.chat.completions.create(
model="glm-4-plus",
messages=[
{"role": "system", "content": "你是一个乐于解答各种问题的助手,你的任务是为用户提供专业、准确、有见地的建议。"},
{"role": "user", "content": "我对太阳系的行星非常感兴趣,特别是土星。请提供关于土星的基本信息,包括其大小、组成、环系统和任何独特的天文现象。"}
],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
