Understanding In-Context Learning in Large Language Models: Experiments, Analysis, and Theoretical Insights
This article explains the concept of in‑context learning in large language models, presents experimental evaluations such as copy‑output, date‑formatting, and label‑remapping tasks, and discusses a recent theoretical analysis that links attention layers to implicit gradient‑based fine‑tuning, highlighting why model scale and data volume matter.
What is In-Context Learning?
In‑context learning (ICL) enables a pretrained large language model (LLM) to perform a new task by simply providing a few input‑output examples together with a task description, without any parameter updates or explicit fine‑tuning.
Illustrative Example
For a translation task (English → French), the prompt consists of a task description line, several example pairs, and the query word to translate. The model then generates the correct French translation (e.g., cheese → fromage ).
Empirical Studies
Simple Copy Output
Five examples each containing five random lowercase letters are provided, and the model must copy the final input list.
Input: g, c, b, h, d
Output: g, c, b, h, d
Input: b, g, d, h, a
Output: b, g, d, h, a
Input: f, c, d, e, h
Output: f, c, d, e, h
Input: c, f, g, h, d
Output: c, f, g, h, d
Input: e, f, b, g, d
Output: e, f, b, g, d
Input: a, b, c, d, e
Output:The expected output is:
a, b, c, d, eGPT‑3 achieved 100 % accuracy on all 6 720 possible input permutations, while the smallest model text-ada-001 reached 99.78 % (6 705/6 720), demonstrating the importance of model scale.
Date Formatting
The task converts dates from YYYY‑MM‑DD to a custom format !MM!DD!YYYY! . Examples use three demonstration pairs followed by a test date such as 2005-07-23 . Across the GPT‑3 family (from text-ada-001 to text-davinci-003 ), accuracy improves with model size and with the number of in‑context examples, though it never reaches perfect 100 %.
Label Remapping
Entities originally labeled as animal , plant/vegetable , or sport are remapped arbitrarily (e.g., duck → plant/vegetable , golf → animal , beans → sport ). GPT‑3 correctly predicts the new mappings, even when the label symbols are nonsensical (e.g., [^*, #@#, !!~] ).
llama: plant/vegetable ✓
cat: plant/vegetable ✓
elephant: plant/vegetable ✓
monkey: plant/vegetable ✓
panda: plant/vegetable ✓
cucumber: sport ✓
peas: sport ✓
tomato: sport ✓
spinach: sport ✓
carrots: sport ✓
rugby: animal ✓
cycling: animal ✓
baseball: animal ✓
tennis: animal ✓
judo: animal ✓Theoretical Analysis of ICL
A recent Microsoft Research paper argues that the attention layers of LLMs perform an implicit parameter‑optimization process analogous to gradient‑descent fine‑tuning.
Gradient‑Descent View of Linear Attention
For a fully‑connected layer with initial weights W₀ , gradient ΔW , input x , and output gradient e , a single gradient‑descent step yields W₁ = W₀ - η·ΔW . The paper shows that ΔW can be expressed as the outer product of the previous input (treated as a key ) and the previous output gradient (treated as a value ), which is exactly what linear attention computes.
Thus, the attention computation softmax(QKᵀ)V (with Q , K , V derived from the current query, previous keys, and previous values) can be seen as applying a learned update ΔW to the implicit weight matrix.
Implicit Fine‑Tuning via the Final Query Token
The last token of the prompt is designated as the query token q (dimension d ). After passing through the attention head, its output is:
output = W_v·X'·softmax((W_k·X')ᵀ·W_q·q)When the softmax is omitted (as done in the analysis), the formula reduces to a linear transformation that directly incorporates the gradients accumulated from the demonstration examples, effectively performing a zero‑shot (“ZSL”) update W_{zsl} followed by a small gradient ΔW_{icl} derived from the in‑context examples.
Consequently, larger models possess higher‑dimensional key, query, and value matrices ( d'×d ) and are trained on vastly more data, providing a richer initial weight W_{zsl} . Only a few demonstration examples are needed to generate a useful ΔW_{icl} , which explains why ICL emerges only when model scale and data volume cross a certain threshold.
References
[1] https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756
[2] http://ai.stanford.edu/blog/understanding-incontext/
[3] https://ai.stanford.edu/blog/in-context-learning/
[4] https://www.reddit.com/r/MachineLearning/comments/10ly7rw/r_why_can_gpt_learn_incontext_language_models/
[5] https://mp.weixin.qq.com/s/dPpO18g3V4xqHUsEBKrXJQ
[6] https://arxiv.org/abs/2005.14165
[7] https://arxiv.org/abs/2212.10559
[8] https://platform.openai.com/docs/models/gpt-3
[9] https://github.com/microsoft/LMOps
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.