Testing 1M‑Token LLMs with a Novel Medal‑Insertion Benchmark
The article presents a practical method for evaluating 1‑million‑token LLMs by inserting structured medal data into a classic Chinese novel, provides a full Python script for the test, shares results on GLM‑4‑long, and discusses training techniques and open‑source resources for long‑context models.
In response to growing business needs for handling very long texts in prompts, the author explores how to evaluate large language models (LLMs) that support up to 1 million tokens, such as the GLM‑4‑long model from the BigModel Model Center.
Limitations of Existing "Sea of Needles" Tests
The traditional "sea of needles" benchmark inserts short phrases into a long text and asks the model to reason over them. It suffers from two major issues: many test items can be answered by focusing on a single sentence, and most models have already seen similar patterns during training, resulting in near‑perfect scores that lack discriminative power.
Proposed Medal‑Insertion Benchmark
The author designs a more challenging test by embedding information about the Chinese team's Olympic medal counts into the classic novel Dream of the Red Chamber . The inserted sentences are distributed uniformly throughout the text, ensuring the model must process the entire context.
def read_txt(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
return content
def write_txt(file_path, content):
with open(file_path, 'w', encoding='utf-8') as file:
file.write(content)
def generate_insert_positions(content_length, num_inserts):
"""Generate evenly spaced insertion positions."""
step = content_length // (num_inserts + 1)
positions = [(i + 1) * step for i in range(num_inserts)]
return positions
def main(file_path, gold_list, silver_list, bronze_list):
original_content = read_txt(file_path)[:1024*(1024-2)]
sentences = []
max_len = max(len(gold_list), len(silver_list), len(bronze_list))
for i in range(max_len):
if i < len(gold_list):
sentences.append(f"中国队获得了 金牌 {gold_list[i]} 枚。")
if i < len(silver_list):
sentences.append(f"中国队获得了 银牌 {silver_list[i]} 枚。")
if i < len(bronze_list):
sentences.append(f"中国队获得了 铜牌 {bronze_list[i]} 枚。")
num_inserts = len(sentences)
positions = generate_insert_positions(len(original_content), num_inserts)
updated_parts = []
last = 0
for i, pos in enumerate(positions):
updated_parts.append(original_content[last:pos])
updated_parts.append("
" + sentences[i] + "
")
last = pos
updated_parts.append(original_content[last:])
return "".join(updated_parts)
if __name__ == "__main__":
file_path = "The_Story_of_the_Stone.txt"
gold_list = [3, 2, 1]
silver_list = [2, 8, 9]
bronze_list = [7, 10, 13]
prompt = "请你根据如下文本,整理中国队获得的金银铜牌数。格式: {\"金牌数\":[x,x,x,...],\"银牌数\":[x,x,x,...],\"铜牌数\":[x,x,x,...]}"
counting = main(file_path, gold_list, silver_list, bronze_list)
print(len(counting))
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="XXX")
response = client.chat.completions.create(
model="glm-4-long",
messages=[{"role": "user", "content": prompt + counting}]
)
print(response.choices[0].message.content)Running this script against GLM‑4‑long yields the exact medal counts:
{"金牌数":[3,2,1],"银牌数":[2,8,9],"铜牌数":[7,10,13]}The model answers correctly, demonstrating that the benchmark can differentiate long‑context capabilities.
Extended Financial‑Report QA Example
The author also shows a second scenario: feeding Apple’s Q3 financial report (converted from PDF via Mathpix) into the model and asking specific questions. Sample prompts and model answers include:
Q: "Apple 2024 Q3 R&D cost?" A: "US$8.0 billion"
Q: "Revenue growth vs 2023 Q3?" A: "US$3.98 billion"
Q: "Gross margin increase percentage?" A: "8.97%"
How Long‑Context Models Are Trained
The article explains that building a 1 M‑token model typically follows a progressive curriculum: start training with 4 k context, then 8 k, gradually scaling to 128 k and finally 1 M. Techniques such as batch‑sort (aligning sequence lengths within a batch) reduce padding waste, and a small auxiliary model can generate synthetic instruction data for longer fragments.
Infrastructure support is crucial, but fine‑tuning can be done with modest resources once the base model is ready.
Open‑Source Resources
An open‑source 9 B‑parameter version of the model with 1 M context is available at: https://huggingface.co/THUDM/glm-4-9b-chat-1m This release illustrates the broader trend of Chinese LLM open‑source efforts lowering entry barriers and fostering community‑wide advancement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
