Why Today's LLMs Still Struggle with “Learn‑and‑Apply” Tasks: Insights from the CL‑Bench Study

The CL‑Bench benchmark reveals that current large language models fail to learn and apply new, long‑context knowledge, exposing critical gaps in context learning, scoring design, and error patterns across ten cutting‑edge models.

PaperAgent
PaperAgent
PaperAgent
Why Today's LLMs Still Struggle with “Learn‑and‑Apply” Tasks: Insights from the CL‑Bench Study

1. The Value of CL‑Bench

CL‑Bench is a benchmark created by Tencent Hunyuan and Fudan University to evaluate a model's ability to learn from a completely new, complex context (up to 65 k tokens) and then answer 1–12 questions that require that newly acquired knowledge. Models that rely solely on pre‑training perform poorly, with an ablation study showing less than 1 % task success when they try to cheat.

2. Design Principles

Each task is built around a strict principle: the model must extract and use new information from the provided context, which is self‑contained and requires no external retrieval or hidden assumptions.

All necessary information is explicitly included in the context, ensuring a fair test of true “in‑context learning”.

Illustration of learn‑and‑apply vs traditional prompt inference
Illustration of learn‑and‑apply vs traditional prompt inference

3. Task Types and Scale

CL‑Bench comprises four major question categories and 18 sub‑categories, covering a total of 500 contexts, 1 899 tasks, and 31 607 rubric items. The average input length is 10.4 k tokens, with the longest context reaching 65 k tokens.

CL‑Bench example tasks
CL‑Bench example tasks

4. Scoring Mechanism – All‑Or‑Nothing

Each question is paired with 10–20 automatically‑gradable rubric items (format, facts, calculation, logic, etc.). A model receives 1 point only if it satisfies every rubric item; otherwise it gets 0, eliminating “partial credit”.

Score = 1: The answer must perfectly satisfy every rubric criterion
Score = 0: Any single criterion not satisfied results in zero
Scoring illustration
Scoring illustration

5. Front‑Line Model Performance

Ten state‑of‑the‑art models were evaluated. Key findings include:

Inductive vs. deductive: Tasks requiring empirical discovery (inductive) achieved only 11.8 % success, six percentage points lower than other categories.

Length is a killer: When input exceeds 32 k tokens, all models see their scores drop dramatically.

Higher reasoning level ≠ better performance: GPT‑5.2’s “high” reasoning mode actually reduced accuracy by 5.6 % compared to “low”.

Model results
Model results

6. Error Analysis – How Models “Cheat”

Three dominant error types were identified:

Context Ignored (30 %): The model treats fabricated legal rules as real laws.

Context Misused (60 %): The model references the context but applies wrong rules or parameters.

Format Error (35 %): Output format mistakes such as missing fields or incorrect ordering in JSON.

Error distribution
Error distribution

7. One‑Sentence Takeaway

CL‑Bench acts like a “closed‑book rapid‑read + on‑site‑practice” exam, demonstrating that the ability to learn and apply new information on the fly remains the most missing universal capability in next‑generation large language models.

8. Resources

https://www.clbench.com/
https://github.com/Tencent-Hunyuan/CL-bench
https://huggingface.co/datasets/tencent/CL-bench
large language modelsbenchmarkAI researchLLM evaluationcontext learning
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.