When Long Prompts Cause Forgetting: Understanding Generalization in In‑Context Continual Learning

The paper introduces a theoretical framework for In‑Context Continual Learning, showing how shared attention in large language models creates bias, variance, and a novel interference term that explains why longer prompts can lead to forgetting, and provides concrete guidelines for prompt design based on task similarity, context length, and order.

Data Party THU
Data Party THU
Data Party THU
When Long Prompts Cause Forgetting: Understanding Generalization in In‑Context Continual Learning

Introduction

Large language models (LLMs) excel at in‑context learning (ICL), adapting to new tasks from a few demonstrations without any parameter updates. When a prompt contains a sequence of heterogeneous tasks, the shared attention mechanism inevitably mixes representations from earlier tasks into later predictions, leading to order‑sensitive performance degradation and forgetting. The authors ask whether this phenomenon can be understood as a form of continual learning that occurs purely during inference.

Contributions

Formalize In‑Context Continual Learning (ICCL) as the setting where a pretrained Transformer processes a single prompt that concatenates multiple tasks, each with its own context examples and query, using shared attention without any parameter updates.

Derive a bias‑variance‑interference decomposition of the expected mean‑squared error (MSE) for each task, quantifying how task similarity, context length, and task order jointly determine positive or provable negative transfer.

Provide a theoretical foundation for empirical phenomena such as sequential forgetting and the non‑monotonic benefit of longer contexts.

Methodology

Formal Setting

For each task τ, the prompt contains context pairs (x_{τ,i}, y_{τ,i}) and a query x_{τ,q}. All vectors are concatenated into a matrix E. The paper studies linear self‑attention (softmax omitted) and masked linear self‑attention (causal mask). The attention output can be written as Attn(E) = E + (W_V E)·(E^T W_Q^T W_K E) and the prediction for the query is

\hat{y}_{\tau,q} = \sum_{t \le \tau}\sum_i \alpha_{t,i}\, y_{t,i} + \text{bias term}

where \alpha_{t,i} are the attention weights determined by similarity between the query and historical keys, subject to the causal mask.

Bias‑Variance‑Interference Decomposition

Assuming each task follows a true function f_τ(x) with additive noise ε, the expected MSE for task τ decomposes as

\mathbb{E}[(\hat{y}_{\tau,q} - f_\tau(x_{\tau,q}))^2] = \text{Bias}^2 + \text{Variance} + \text{Interference}

The three terms are:

Bias : systematic shift caused by averaging heterogeneous task samples in attention.

Variance : error propagation from noisy labels in the context examples, weighted by the squared attention coefficients.

Interference : a novel term capturing the extra error introduced by unrelated historical tasks. It can be expressed as<br>

\text{Interference}_\tau = \sum_{t 
eq \tau} \beta_{\tau,t}\, \|f_t - P_t f_\tau\|^2

, where \beta_{\tau,t} depends on attention weights and context length, and P_t projects the current task function onto the subspace of task t.

The analysis shows that when historical tasks are dissimilar, the interference term dominates, causing negative transfer; when they are similar, interference diminishes and the variance reduction yields positive transfer.

Key Insights

Context Length : Adding more examples always reduces variance for the first task, but for later tasks the MSE first drops then rises, revealing an optimal context length beyond which additional history harms performance.

Task Similarity : High cosine similarity between task functions leads to substantial MSE reduction (positive transfer); opposite similarity leads to severe MSE increase (negative transfer).

Task Order : Causal attention makes early tasks influence later ones, but not vice‑versa. Placing similar tasks early mitigates later interference, while placing dissimilar tasks early amplifies forgetting.

Relation to Standard Attention

The paper proves that both uniform (averaging) and causal attention inevitably cause cross‑task interference because they treat all historical positions uniformly or solely by position, lacking any mechanism to recognize task boundaries.

Experiments

Although the work is primarily theoretical, the authors validate the predictions with synthetic simulations and a real‑world LLM evaluation.

Synthetic Linear Regression

Tasks are linear functions y = w^T x + ε with task‑specific weight vectors w. Inputs x are drawn from a standard Gaussian. Two attention variants are tested: uniform linear attention and causal linear attention. Metrics are per‑task MSE and its decomposition.

Figure 1 (context length vs. MSE): The first task’s MSE monotonically decreases with more context; later tasks exhibit a U‑shaped curve, confirming the existence of an optimal context length.

Figure 2 (task similarity vs. MSE): When cosine similarity ≈ 1, MSE drops dramatically (positive transfer); when ≈ ‑1, MSE spikes (negative transfer); similarity ≈ 0 yields intermediate errors.

Figure 3 (task order vs. forgetting): Reordering the same set of tasks changes the forgetting curves; placing the easiest task last yields the lowest final MSE, while placing the hardest last yields the highest.

Figure 4 (training loss curves for GPT‑2 Tiny and Small): Demonstrates that pretrained attention models can converge to the analytically tractable regime assumed in the theory.

Real‑World Model Evaluation

Using Qwen2.5‑1.5B‑Instruct, the authors construct an ICCL benchmark: Task A is SST‑2 sentiment classification, Task B is AG News topic classification. Both tasks’ in‑context examples are concatenated into a single prompt.

With a single example per task ( M=1), Task B accuracy drops > 15 % relative to a single‑task baseline, and Task A accuracy falls from 0.934 to 0.472, illustrating severe negative transfer and forgetting.

Increasing M mitigates Task B’s negative transfer but Task A’s forgetting persists, matching the theory’s claim that misaligned task means cause irreducible interference.

Conclusion

The paper’s main contributions are the first formal definition of ICCL, the bias‑variance‑interference decomposition for linear self‑attention, and a theoretical explanation of long‑prompt phenomena such as order‑sensitive forgetting and non‑monotonic performance gains. Limitations include the focus on linear and masked linear attention, linear task functions, and the assumption of an optimally pretrained model, leaving extensions to softmax, multi‑head, deeper Transformers and realistic positional encodings as future work.

Practical takeaways for prompt engineering are: group similar tasks together, avoid interleaving unrelated tasks, and limit the amount of historical context visible to a query. Evaluation protocols should incorporate task similarity, context length, and order to faithfully assess LLM robustness in realistic multi‑task prompts.

Future directions include designing attention mechanisms that are aware of task boundaries, extending the analysis to nonlinear functions and multi‑head architectures, and exploring prompt formats (task tags, separators) that reduce interference.

Figure 1: Context length vs. MSE
Figure 1: Context length vs. MSE
Figure 2: Task similarity vs. MSE
Figure 2: Task similarity vs. MSE
Figure 3: Task order vs. forgetting
Figure 3: Task order vs. forgetting
Figure 4: Training loss curves for GPT‑2 models
Figure 4: Training loss curves for GPT‑2 models
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Prompt Engineeringlarge language modelsattention mechanismcontinual learningin-context learningbias-variance-interference
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.