Why Today's LLMs Still Struggle with “Learn‑and‑Apply” Tasks: Insights from the CL‑Bench Study
The CL‑Bench benchmark reveals that current large language models fail to learn and apply new, long‑context knowledge, exposing critical gaps in context learning, scoring design, and error patterns across ten cutting‑edge models.
1. The Value of CL‑Bench
CL‑Bench is a benchmark created by Tencent Hunyuan and Fudan University to evaluate a model's ability to learn from a completely new, complex context (up to 65 k tokens) and then answer 1–12 questions that require that newly acquired knowledge. Models that rely solely on pre‑training perform poorly, with an ablation study showing less than 1 % task success when they try to cheat.
2. Design Principles
Each task is built around a strict principle: the model must extract and use new information from the provided context, which is self‑contained and requires no external retrieval or hidden assumptions.
All necessary information is explicitly included in the context, ensuring a fair test of true “in‑context learning”.
3. Task Types and Scale
CL‑Bench comprises four major question categories and 18 sub‑categories, covering a total of 500 contexts, 1 899 tasks, and 31 607 rubric items. The average input length is 10.4 k tokens, with the longest context reaching 65 k tokens.
4. Scoring Mechanism – All‑Or‑Nothing
Each question is paired with 10–20 automatically‑gradable rubric items (format, facts, calculation, logic, etc.). A model receives 1 point only if it satisfies every rubric item; otherwise it gets 0, eliminating “partial credit”.
Score = 1: The answer must perfectly satisfy every rubric criterion
Score = 0: Any single criterion not satisfied results in zero5. Front‑Line Model Performance
Ten state‑of‑the‑art models were evaluated. Key findings include:
Inductive vs. deductive: Tasks requiring empirical discovery (inductive) achieved only 11.8 % success, six percentage points lower than other categories.
Length is a killer: When input exceeds 32 k tokens, all models see their scores drop dramatically.
Higher reasoning level ≠ better performance: GPT‑5.2’s “high” reasoning mode actually reduced accuracy by 5.6 % compared to “low”.
6. Error Analysis – How Models “Cheat”
Three dominant error types were identified:
Context Ignored (30 %): The model treats fabricated legal rules as real laws.
Context Misused (60 %): The model references the context but applies wrong rules or parameters.
Format Error (35 %): Output format mistakes such as missing fields or incorrect ordering in JSON.
7. One‑Sentence Takeaway
CL‑Bench acts like a “closed‑book rapid‑read + on‑site‑practice” exam, demonstrating that the ability to learn and apply new information on the fly remains the most missing universal capability in next‑generation large language models.
8. Resources
https://www.clbench.com/
https://github.com/Tencent-Hunyuan/CL-bench
https://huggingface.co/datasets/tencent/CL-benchHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
