Artificial Intelligence 10 min read

Why 1M Context Length Matters: Inside GLM 5.2’s New Techniques

The article examines how 1‑million‑token context has become a standard feature in modern LLMs, explains the compute and memory challenges it brings, reviews the sparse‑attention and token‑selection tricks (including GLM 5.2’s IndexShare and LayerSplit), and outlines practical evaluation methods for measuring long‑context effectiveness.

IT Services Circle

Jun 24, 2026

Why 1M Context Length Matters: Inside GLM 5.2’s New Techniques

1M context adoption

Recent large language models—including DeepSeek, MiniMax, MiMo, and GLM 5.2—support a 1‑million‑token context window.

Technical challenges

Two fundamental obstacles arise when extending context length to 1 M tokens:

Attention computation grows quadratically with sequence length (O(N²)) because each token attends to every other token.

The KV cache must store a key‑value vector for every token, causing prohibitive GPU memory consumption.

Efficient attention methods

Sparse attention reduces the O(N²) cost. Several token‑selection schemes are used:

Sliding‑window attention (SWA) : fixes a local window of tokens for each attention step.

DeepSeek DSA : selects a small subset of tokens for full attention.

MiniMax MSA : a similar token‑filtering approach.

GLM 5.2 IndexShare : reuses the same Indexer results across four transformer layers, effectively sharing the DSA computation.

Memory‑compression techniques

To shrink the per‑token KV footprint, models employ:

MQA (Multi‑Query Attention)

GQA (Grouped Query Attention)

MLA (Multi‑Head Linear Attention)

GLM 5.2 adds a LayerSplit mechanism that partitions layers across GPUs, lowering per‑GPU memory usage in multi‑GPU deployments.

Effectiveness concerns

Supporting a 1 M token window does not guarantee answer quality. Users have reported severe hallucination problems in DeepSeek, indicating that raw context length alone is insufficient.

Evaluation benchmarks

Long‑context effectiveness is measured by two families of tasks:

Precise information retrieval – e.g., the MRCR benchmark, where a model must locate a specific poem among thousands generated.

Sustained reasoning over long tasks – e.g., FrontierSWE challenges such as the FrogsGame problem, which requires a model (e.g., Qwen after fine‑tuning) to solve a multi‑hour mathematical task.

Additional long‑task benchmarks include SWE marathons that simulate months‑long software‑engineering projects, with agent reasoning chains visualized in the accompanying figures.

Training implications

Because pre‑training data rarely contain ultra‑long tasks, future progress depends on incorporating long‑task chains into training pipelines. Zhipu’s Slime framework provides infrastructure for reinforcement‑learning‑based fine‑tuning on such tasks.

Code example

来源丨
经授权转自
闪客公众号
作者丨
飞天闪客
1M 上下文之前还是个新鲜的事儿，现在已经越来越多的模型把它作为标配了。
御三家最新的模型都已经支持 1M 上下文，国内的 DeepSeek、MiniMax、MiMo 以及刚刚发布的 GLM 也都支持了。
哦对了有个小技巧就是你可以在 OpenRouter 上把上下文筛选这个条拉到 1M 就能看到所有支持的模型了。而且平时我看评论区经常有读者问这个模型价格多少，那个模型上下文多长等问题，都可以在这里直接得到答案。
曾经的 1M 上下文的技术难点在于两个方面，一个是计算量太大，一个是显存占用太多。
计算量大是因为，每个 Token 都需要和其他 Token 做注意力计算，整体的复杂度就是随着上下文长度 N² 增长。后来人们发明了稀疏注意力（Sparse Attention）来解决这个问题。
有的是只固定一个窗口的 Token 做计算，叫做 SWA 滑动窗口注意力。有的是想办法筛选少量的 Token 做计算，比如 DeepSeek 的 DSA，MiniMax 的 MSA，以及 GLM5.2 这次的针对 DSA 的改良方案 IndexShare，简单说就是每四层复用同一个 DSA 中的 Indexer 计算结果。
显存占用太多是因为每个 Token 的 KV 向量需要缓存，叫做 KV Cache，上下文长了自然缓存的内容就多。后来人们发明了 MQA GQA MLA 等压缩手段来减负，思路也是压缩向量呀、复用向量呀这些无聊的手段。
然后智谱这次的 GLM 5.2 又多了一项 LayerSplit 技术，在多卡场景下降低了单卡的显存占用，总之就是各显神通在各种地方想办法肩负了。
计算量和显存占用是纯技术层面的，已经慢慢被大家攻破了，但相信你也能看出来，这些手段咱普通人也能想到，就是不知道会不会影响效果嘛，毕竟直接砍掉了很多上下文的计算，凭啥保证没影响呢？
所以说现在大家也都聪明了，只是支持 1M 上下文已经不能让大家觉得你牛逼，还是得看有效性。比如我经常听到有读者吐槽 DeepSeek 虽然支持 1M 但幻觉问题严重，就像一个人说自己一个小时就能看完一本书一样，只是看完了但啥也没记住，那这就没什么意义了。
当然了，各大模型厂推出一种方案的时候，一定是做好了各种对比实验、消融实验而得出的结果，也一定是尝试了各种方案和各种压缩配比，选出一个对性能影响不大同时又最大限度降低计算量和显存的方式，所以测评一个模型长上下文的有效程度就显得很重要了。
具体测试方法很多，不过大致分成两类，
一类是从茫茫信息中精准定位的能力，比如 MRCR，其中有道题目就是让AI写了几千首诗，最后要从中找出一首特定主题的。
还有一类是长程任务的能力，大概就是给一个非常变态的需要好几个小时连续运行才能跑出结果的题目，考察长时间执行积累了很长上下文后模型不会懵逼的能力。
比如 FrontierSWE 里有道题目是对 Qwen 模型进行后训练使其有能力解决一个叫 FrogsGame 的数学问题。
更变态一点的还有 SWE 马拉松，听着名字就很可怕，里面都是什么从零实现编译器、重构一个大型项目等需要人类团队几个月的工作量。
点进去还可以看到被测试的 Agent 的全部思考链路，反正看这个测试题库我是感觉人类在这方面真的是拼不过 AI 了，这些题目想象就头大。
那怎么提升有效性呢？没办法，只能是靠训练。之前的训练没在这么长上下文中练过，也很难获得长任务的思考链路这样的数据，但现在 Agent 大趋势下，各个模型的训练数据中就必须包含这种长任务数据链了。
当然也可以在生成这些复杂任务和环境的基础设施上下功夫，比如智谱这次的 slime 框架，就是为模型更好进行这些任务的强化学习后训练做了基架。
但这里任重而道远，我们人类已经很难给 AI 提供更艰难的任务，再往上走靠传统的全网挖数据的方式已经不行了，这也是目前 AI 瓶颈产生的原因之一。
加油吧人类！希望有生之年能彻底突破这个艰难的问题！我能做的也只有把技术科普出去，让更多有能力的人参与进来，攻破一道道难关～
1、
AI 又立一功？倒逼 Linux 内核清理历史包袱
2、
为什么90年代PC机箱前总有个加速键？
3、
这个Win11更新，小心电脑变砖！
4、
为什么这么多设备都禁ping？禁ping到底图什么？
5、
Firefox为什么掉队？15年老员工离职发声：它本来就是一款小众浏览器，别再模仿Chrome、Edge了

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM scaling Sparse attention 1M context GLM-5.2 IndexShare LayerSplit Long‑context evaluation

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.