Artificial Intelligence 10 min read

Testing 1M‑Token LLMs with a Novel Medal‑Insertion Benchmark

The article presents a practical method for evaluating 1‑million‑token LLMs by inserting structured medal data into a classic Chinese novel, provides a full Python script for the test, shares results on GLM‑4‑long, and discusses training techniques and open‑source resources for long‑context models.

Baobao Algorithm Notes

Aug 9, 2024

Testing 1M‑Token LLMs with a Novel Medal‑Insertion Benchmark

In response to growing business needs for handling very long texts in prompts, the author explores how to evaluate large language models (LLMs) that support up to 1 million tokens, such as the GLM‑4‑long model from the BigModel Model Center.

Limitations of Existing "Sea of Needles" Tests

The traditional "sea of needles" benchmark inserts short phrases into a long text and asks the model to reason over them. It suffers from two major issues: many test items can be answered by focusing on a single sentence, and most models have already seen similar patterns during training, resulting in near‑perfect scores that lack discriminative power.

Proposed Medal‑Insertion Benchmark

The author designs a more challenging test by embedding information about the Chinese team's Olympic medal counts into the classic novel Dream of the Red Chamber . The inserted sentences are distributed uniformly throughout the text, ensuring the model must process the entire context.

def read_txt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    return content

def write_txt(file_path, content):
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(content)

def generate_insert_positions(content_length, num_inserts):
    """Generate evenly spaced insertion positions."""
    step = content_length // (num_inserts + 1)
    positions = [(i + 1) * step for i in range(num_inserts)]
    return positions

def main(file_path, gold_list, silver_list, bronze_list):
    original_content = read_txt(file_path)[:1024*(1024-2)]
    sentences = []
    max_len = max(len(gold_list), len(silver_list), len(bronze_list))
    for i in range(max_len):
        if i < len(gold_list):
            sentences.append(f"中国队获得了 金牌 {gold_list[i]} 枚。")
        if i < len(silver_list):
            sentences.append(f"中国队获得了 银牌 {silver_list[i]} 枚。")
        if i < len(bronze_list):
            sentences.append(f"中国队获得了 铜牌 {bronze_list[i]} 枚。")
    num_inserts = len(sentences)
    positions = generate_insert_positions(len(original_content), num_inserts)
    updated_parts = []
    last = 0
    for i, pos in enumerate(positions):
        updated_parts.append(original_content[last:pos])
        updated_parts.append("
" + sentences[i] + "
")
        last = pos
    updated_parts.append(original_content[last:])
    return "".join(updated_parts)

if __name__ == "__main__":
    file_path = "The_Story_of_the_Stone.txt"
    gold_list = [3, 2, 1]
    silver_list = [2, 8, 9]
    bronze_list = [7, 10, 13]
    prompt = "请你根据如下文本，整理中国队获得的金银铜牌数。格式： {\"金牌数\":[x,x,x,...],\"银牌数\":[x,x,x,...],\"铜牌数\":[x,x,x,...]}"
    counting = main(file_path, gold_list, silver_list, bronze_list)
    print(len(counting))
    from zhipuai import ZhipuAI
    client = ZhipuAI(api_key="XXX")
    response = client.chat.completions.create(
        model="glm-4-long",
        messages=[{"role": "user", "content": prompt + counting}]
    )
    print(response.choices[0].message.content)

Running this script against GLM‑4‑long yields the exact medal counts:

{"金牌数":[3,2,1],"银牌数":[2,8,9],"铜牌数":[7,10,13]}

The model answers correctly, demonstrating that the benchmark can differentiate long‑context capabilities.

Extended Financial‑Report QA Example

The author also shows a second scenario: feeding Apple’s Q3 financial report (converted from PDF via Mathpix) into the model and asking specific questions. Sample prompts and model answers include:

Q: "Apple 2024 Q3 R&D cost?" A: "US$8.0 billion"

Q: "Revenue growth vs 2023 Q3?" A: "US$3.98 billion"

Q: "Gross margin increase percentage?" A: "8.97%"

How Long‑Context Models Are Trained

The article explains that building a 1 M‑token model typically follows a progressive curriculum: start training with 4 k context, then 8 k, gradually scaling to 128 k and finally 1 M. Techniques such as batch‑sort (aligning sequence lengths within a batch) reduce padding waste, and a small auxiliary model can generate synthetic instruction data for longer fragments.

Infrastructure support is crucial, but fine‑tuning can be done with modest resources once the base model is ready.

Open‑Source Resources

An open‑source 9 B‑parameter version of the model with 1 M context is available at: https://huggingface.co/THUDM/glm-4-9b-chat-1m This release illustrates the broader trend of Chinese LLM open‑source efforts lowering entry barriers and fostering community‑wide advancement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AI LLM prompt-engineering long-context

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.