Artificial Intelligence 7 min read

Why Today's LLMs Still Struggle with “Learn‑and‑Apply” Tasks: Insights from the CL‑Bench Study

The CL‑Bench benchmark reveals that current large language models fail to learn and apply new, long‑context knowledge, exposing critical gaps in context learning, scoring design, and error patterns across ten cutting‑edge models.

PaperAgent

Feb 3, 2026

Why Today's LLMs Still Struggle with “Learn‑and‑Apply” Tasks: Insights from the CL‑Bench Study

1. The Value of CL‑Bench

CL‑Bench is a benchmark created by Tencent Hunyuan and Fudan University to evaluate a model's ability to learn from a completely new, complex context (up to 65 k tokens) and then answer 1–12 questions that require that newly acquired knowledge. Models that rely solely on pre‑training perform poorly, with an ablation study showing less than 1 % task success when they try to cheat.

2. Design Principles

Each task is built around a strict principle: the model must extract and use new information from the provided context, which is self‑contained and requires no external retrieval or hidden assumptions.

All necessary information is explicitly included in the context, ensuring a fair test of true “in‑context learning”.

Illustration of learn‑and‑apply vs traditional prompt inference

3. Task Types and Scale

CL‑Bench comprises four major question categories and 18 sub‑categories, covering a total of 500 contexts, 1 899 tasks, and 31 607 rubric items. The average input length is 10.4 k tokens, with the longest context reaching 65 k tokens.

4. Scoring Mechanism – All‑Or‑Nothing

Each question is paired with 10–20 automatically‑gradable rubric items (format, facts, calculation, logic, etc.). A model receives 1 point only if it satisfies every rubric item; otherwise it gets 0, eliminating “partial credit”.

Score = 1: The answer must perfectly satisfy every rubric criterion
Score = 0: Any single criterion not satisfied results in zero

5. Front‑Line Model Performance

Ten state‑of‑the‑art models were evaluated. Key findings include:

Inductive vs. deductive: Tasks requiring empirical discovery (inductive) achieved only 11.8 % success, six percentage points lower than other categories.

Length is a killer: When input exceeds 32 k tokens, all models see their scores drop dramatically.

Higher reasoning level ≠ better performance: GPT‑5.2’s “high” reasoning mode actually reduced accuracy by 5.6 % compared to “low”.

6. Error Analysis – How Models “Cheat”

Three dominant error types were identified:

Context Ignored (30 %): The model treats fabricated legal rules as real laws.

Context Misused (60 %): The model references the context but applies wrong rules or parameters.

Format Error (35 %): Output format mistakes such as missing fields or incorrect ordering in JSON.

7. One‑Sentence Takeaway

CL‑Bench acts like a “closed‑book rapid‑read + on‑site‑practice” exam, demonstrating that the ability to learn and apply new information on the fly remains the most missing universal capability in next‑generation large language models.

8. Resources

https://www.clbench.com/
https://github.com/Tencent-Hunyuan/CL-bench
https://huggingface.co/datasets/tencent/CL-bench

large language models benchmark AI research LLM evaluation context learning

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.