Artificial Intelligence 16 min read

Focused Large Language Models are Stable Many-Shot Learners

FocusICL mitigates the reverse‑scaling of in‑context learning by masking irrelevant tokens and applying hierarchical batch attention, cutting attention complexity, and delivering consistent query focus that yields average accuracy gains of about 5 % across multiple LLMs and benchmarks.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Focused Large Language Models are Stable Many-Shot Learners

Recent studies have shown that traditional in‑context learning (ICL) methods for large language models (LLMs) fail to improve performance as the number of demonstration examples increases, and may even exhibit a reverse‑scaling phenomenon. The root cause is identified as attention dispersion: with many examples, the model’s attention is distracted by irrelevant content and cannot stay focused on the query.

To address this bottleneck, the Xiaohongshu search team introduced FocusICL , a novel multi‑example ICL framework presented at EMNLP 2024. FocusICL incorporates two key innovations: (1) a token‑level irrelevant‑content filtering strategy that masks unimportant tokens in demonstrations, and (2) a hierarchical (example‑level) attention mechanism that processes examples in batches, limiting intra‑batch attention competition while aggregating information across batches.

The paper (arXiv:2408.13987) reports extensive experiments on the MATH benchmark (12,500 math reasoning questions across 7 sub‑datasets and 5 difficulty levels) using GEMINI‑1.5‑PRO with a 1M‑token context window. Example counts ranging from 1 to 256 were evaluated with greedy decoding, and five runs were averaged for stability. Five out of seven sub‑datasets showed clear reverse‑scaling under standard ICL, confirming the scalability issue.

FocusICL was further evaluated on multiple LLMs (LONGCHAT‑7B‑V1.5‑32K, VICUNA‑7B‑V1.5‑16K, LLAMA‑3‑8B‑INSTRUCT) across diverse tasks (CSQA, PIQA, GSM8K). Compared with baseline ICL, EARLYSTOP, and STRUCTICL, FocusICL achieved an average gain of 5.2 % (3.31 points) and statistically significant improvements (p < 0.05). Ablation studies revealed that irrelevant‑content filtering contributed +1.29 points, while hierarchical attention added +2.02 points.

Complexity analysis shows that standard ICL incurs O(N²) attention cost (N = total examples), whereas FocusICL reduces the cost to O(B·N) with B << N, yielding a much lower inference overhead.

Additional analyses of attention weight distributions and hidden‑state PCA demonstrate that FocusICL maintains stable query‑focused attention and consistent hidden representations as the number of examples grows, unlike vanilla ICL where attention to the query drops and hidden states shift.

Overall, FocusICL provides a stable many‑shot learning paradigm for LLMs, overcoming the attention‑dispersion bottleneck and delivering both higher accuracy and better scalability when increasing the number of demonstrations.

large language modelsattention mechanismsfew-shot learningFocusICLIn-Context Learning
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.