Artificial Intelligence 15 min read

Understanding In-Context Learning in Large Language Models: Experiments, Analysis, and Theoretical Insights

This article explains the concept of in‑context learning in large language models, presents experimental evaluations such as copy‑output, date‑formatting, and label‑remapping tasks, and discusses a recent theoretical analysis that links attention layers to implicit gradient‑based fine‑tuning, highlighting why model scale and data volume matter.

DataFunSummit

Feb 19, 2023

Understanding In-Context Learning in Large Language Models: Experiments, Analysis, and Theoretical Insights

What is In-Context Learning?

In‑context learning (ICL) enables a pretrained large language model (LLM) to perform a new task by simply providing a few input‑output examples together with a task description, without any parameter updates or explicit fine‑tuning.

Illustrative Example

For a translation task (English → French), the prompt consists of a task description line, several example pairs, and the query word to translate. The model then generates the correct French translation (e.g., cheese → fromage).

Empirical Studies

Simple Copy Output

Five examples each containing five random lowercase letters are provided, and the model must copy the final input list.

Input: g, c, b, h, d
Output: g, c, b, h, d
Input: b, g, d, h, a
Output: b, g, d, h, a
Input: f, c, d, e, h
Output: f, c, d, e, h
Input: c, f, g, h, d
Output: c, f, g, h, d
Input: e, f, b, g, d
Output: e, f, b, g, d
Input: a, b, c, d, e
Output:

The expected output is: a, b, c, d, e GPT‑3 achieved 100 % accuracy on all 6 720 possible input permutations, while the smallest model text-ada-001 reached 99.78 % (6 705/6 720), demonstrating the importance of model scale.

Date Formatting

The task converts dates from YYYY‑MM‑DD to a custom format !MM!DD!YYYY!. Examples use three demonstration pairs followed by a test date such as 2005-07-23. Across the GPT‑3 family (from text-ada-001 to text-davinci-003), accuracy improves with model size and with the number of in‑context examples, though it never reaches perfect 100 %.

Label Remapping

Entities originally labeled as animal, plant/vegetable, or sport are remapped arbitrarily (e.g., duck → plant/vegetable, golf → animal, beans → sport). GPT‑3 correctly predicts the new mappings, even when the label symbols are nonsensical (e.g., [^*, #@#, !!~]).

llama: plant/vegetable ✓
cat: plant/vegetable ✓
elephant: plant/vegetable ✓
monkey: plant/vegetable ✓
panda: plant/vegetable ✓
cucumber: sport ✓
peas: sport ✓
tomato: sport ✓
spinach: sport ✓
carrots: sport ✓
rugby: animal ✓
cycling: animal ✓
baseball: animal ✓
tennis: animal ✓
judo: animal ✓

Theoretical Analysis of ICL

A recent Microsoft Research paper argues that the attention layers of LLMs perform an implicit parameter‑optimization process analogous to gradient‑descent fine‑tuning.

Gradient‑Descent View of Linear Attention

For a fully‑connected layer with initial weights W₀, gradient ΔW, input x, and output gradient e, a single gradient‑descent step yields W₁ = W₀ - η·ΔW. The paper shows that ΔW can be expressed as the outer product of the previous input (treated as a key) and the previous output gradient (treated as a value), which is exactly what linear attention computes.

Thus, the attention computation softmax(QKᵀ)V (with Q, K, V derived from the current query, previous keys, and previous values) can be seen as applying a learned update ΔW to the implicit weight matrix.

Implicit Fine‑Tuning via the Final Query Token

The last token of the prompt is designated as the query token q (dimension d). After passing through the attention head, its output is: output = W_v·X'·softmax((W_k·X')ᵀ·W_q·q) When the softmax is omitted (as done in the analysis), the formula reduces to a linear transformation that directly incorporates the gradients accumulated from the demonstration examples, effectively performing a zero‑shot (“ZSL”) update W_{zsl} followed by a small gradient ΔW_{icl} derived from the in‑context examples.

Consequently, larger models possess higher‑dimensional key, query, and value matrices ( d'×d) and are trained on vastly more data, providing a richer initial weight W_{zsl}. Only a few demonstration examples are needed to generate a useful ΔW_{icl}, which explains why ICL emerges only when model scale and data volume cross a certain threshold.

References

[1] https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756

[2] http://ai.stanford.edu/blog/understanding-incontext/

[3] https://ai.stanford.edu/blog/in-context-learning/

[4] https://www.reddit.com/r/MachineLearning/comments/10ly7rw/r_why_can_gpt_learn_incontext_language_models/

[5] https://mp.weixin.qq.com/s/dPpO18g3V4xqHUsEBKrXJQ

[6] https://arxiv.org/abs/2005.14165

[7] https://arxiv.org/abs/2212.10559

[8] https://platform.openai.com/docs/models/gpt-3

[9] https://github.com/microsoft/LMOps

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Attention Mechanism few-shot learning In-Context Learning GPT-3

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.