Artificial Intelligence 13 min read

How RoPE and FlashAttention Empower GLM-4-Plus for Long-Text Mastery

This article explains the core mechanisms of Transformer models, details the Rotational Position Embedding (RoPE) and FlashAttention techniques for handling long sequences, introduces the GLM-4-Plus series, and presents an empirical evaluation on the THUCNews dataset showing its superior long-text performance.

Sohu Tech Products

Sep 11, 2024

How RoPE and FlashAttention Empower GLM-4-Plus for Long-Text Mastery

Transformer Model Overview

The Transformer architecture is built around the self‑attention mechanism, which allows every token in a sequence to attend to all other tokens simultaneously. This contrasts with recurrent or convolutional networks that process tokens sequentially.

Positional Encoding

Because self‑attention has no intrinsic notion of order, Transformers add positional encodings that are added to the token embeddings. The classic sinusoidal encoding uses fixed sine and cosine functions:

For even dimension 2i: PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) For odd dimension 2i+1: PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model}) The dimensionality of the encoding matches the model dimension.

Parallelism and Residual Connections

Since each self‑attention layer does not depend on previous token outputs, all layers can be computed in parallel, greatly accelerating training. Residual connections and layer‑normalization are added around each sub‑layer to stabilize deep networks and mitigate gradient issues.

Self‑Attention Mechanism

For each token the model computes three vectors via linear projections: query (Q), key (K) and value (V). Attention scores are obtained by the scaled dot‑product of Q and K, followed by a softmax to produce attention weights that are applied to V:

Attention(Q, K, V) = softmax((Q·K^T) / sqrt(d_k)) · V

Long‑Text Processing Techniques

Rotary Position Embedding (RoPE)

RoPE extends positional encoding to arbitrary sequence lengths by rotating token embeddings with a position‑dependent angle. The rotation preserves the magnitude of the embedding vector while injecting relative position information, enabling the model to generalize to longer contexts than seen during training.

https://huggingface.co/THUDM/glm-4-9b/blob/main/modeling_chatglm.py

@torch.jit.script
def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
    b, np, sq, hn = x.size(0), x.size(1), x.size(2), x.size(3)
    rot_dim = rope_cache.shape[-2] * 2
    x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
    rope_cache = rope_cache[:, :sq]
    xshaped = x.reshape(b, np, sq, rot_dim // 2, 2)
    rope_cache = rope_cache.view(-1, 1, sq, xshaped.size(3), 2)
    x_out2 = torch.stack([
        xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
        xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1]
    ], -1)
    x_out2 = x_out2.flatten(3)
    return torch.cat((x_out2, x_pass), dim=-1)

FlashAttention

FlashAttention reduces the memory complexity of self‑attention from O(N²) to O(N) by tiling the attention matrix into small blocks and recomputing softmax statistics on‑the‑fly. This makes it possible to process very long sequences with limited GPU memory.

https://huggingface.co/THUDM/glm-4-9b/blob/main/modeling_chatglm.py

GLM‑4‑Plus Evaluation on Long‑Text Tasks

The GLM‑4 family was evaluated on the THUCNews news‑classification dataset with varying input lengths to study the impact of context window size on accuracy.

Model Variants

GLM‑4‑Plus – flagship model, 128K context

GLM‑4‑0520 – high‑intelligence model, 128K context

GLM‑4‑Long Beta – ultra‑long input, 1 M context

GLM‑4‑AirX – fast inference, 8K context

GLM‑4‑Air – cost‑effective, 128K context

GLM‑4‑Flash – free access, 128K context

Evaluation Method

Each news article is truncated to a specified length and sent to the model via a chat‑completion API. The model’s response is checked against the ground‑truth label.

def news_classify(news_path, texlen, model):
    text = ''.join(open(news_path).readlines())[:texlen]
    data = {
        "model": model,
        "messages": [{"role": "user",
                      "content": f'''请对下面的新闻进行分类，待选类别有: {class_names}
 {text}''' }]
    }
    try:
        response = requests.post(url, headers=headers, json=data, timeout=200)
        if news_path.split('/')[-2] in response.json()['choices'][0]['message']['content']:
            return True
        else:
            return False
    except:
        return None

Results

GLM‑4‑Plus consistently achieved classification accuracy above 0.88 on long‑text inputs, outperforming the other GLM‑4 variants.

GLM‑4‑Plus Capabilities

Enhanced language understanding through massive synthetic data, improving reasoning on mathematics and code.

Long‑text inference enabled by precise short‑long data mixing and advanced pre‑training techniques.

API Usage Examples

Synchronous Call

from zhipuai import ZhipuAI
client = ZhipuAI(api_key="")  # your API key
response = client.chat.completions.create(
    model="glm-4-plus",
    messages=[
        {"role": "user", "content": "作为一名营销专家，请为智谱开放平台创作一个吸引人的slogan"},
        {"role": "assistant", "content": "当然，为了创作一个吸引人的slogan，请告诉我一些关于您产品的信息"},
        {"role": "user", "content": "智谱AI开放平台"},
        {"role": "assistant", "content": "智启未来，谱绘无限一智谱AI，让创新触手可及!"},
        {"role": "user", "content": "创造一个更精准、吸引人的slogan"}
    ]
)
print(response.choices[0].message)

Streaming Call

from zhipuai import ZhipuAI
client = ZhipuAI(api_key="")
response = client.chat.completions.create(
    model="glm-4-plus",
    messages=[
        {"role": "system", "content": "你是一个乐于解答各种问题的助手，你的任务是为用户提供专业、准确、有见地的建议。"},
        {"role": "user", "content": "我对太阳系的行星非常感兴趣，特别是土星。请提供关于土星的基本信息，包括其大小、组成、环系统和任何独特的天文现象。"}
    ],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer FlashAttention RoPE model evaluation GLM-4-Plus Long Text

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.