Multi-Task Bayesian In-Context Learning: Transformers Adapt to New Priors
The ICML 2026 paper reframes in‑context learning as approximate Bayesian inference, introduces explicit prior datasets as a context prefix for Transformers, and demonstrates through synthetic and real‑world experiments that this multi‑task approach closely matches Bayesian oracles while offering fast, controllable inference.
Problem Motivation
In‑context learning (ICL) treats a new task as a sequence of examples fed to a Transformer, but existing ICL models embed the prior distribution implicitly in their weights. Consequently, they cannot change the prior at test time, limiting applicability to scenarios where the prior varies across users, domains, or time.
Core Idea: Prior as a Context Prefix
The paper frames each episode as a hierarchical Bayesian model. An episode‑level hyper‑parameter is sampled from a meta‑distribution and generates multiple prior datasets (the first K tasks) followed by a target dataset . The prior datasets are placed at the beginning of the Transformer’s input sequence, the target data follow. During training the model learns to infer the episode‑level prior from the prefix and then produce posterior predictions for the target. At test time, swapping the prefix datasets changes the prior without any parameter fine‑tuning.
Comparison:
Standard ICL: context contains only target evidence; the prior is baked into model weights.
Multi‑task Bayesian ICL: context contains both prior datasets and target data; the prior can be controlled via the prefix.
Model Details
A small GPT‑2‑style Transformer is trained with negative log‑likelihood. For regression tasks the model outputs a Gaussian mean and variance; for classification/logistic regression it outputs class probabilities.
Experimental Setup
Four families of experiments increase in complexity:
Gaussian priors for linear and logistic regression (sanity check against a Bayesian oracle).
Student‑t heavy‑tailed priors to test out‑of‑meta‑distribution generalization.
Flow‑based high‑dimensional priors to assess scalability.
ERA5 spatiotemporal temperature prediction as a real‑world task.
Baselines include an MCMC oracle, SVI, hierarchical MCMC/SVI, and ICL with and without the prior prefix.
Result 1: Prior‑Prefix ICL Matches Bayesian Oracle
In linear regression, KL divergence between model posteriors and the oracle shows that multi‑task ICL with the prior prefix matches hierarchical MCMC across context lengths and test priors, while vanilla ICL degrades under prior shift. Logistic regression experiments show the same pattern: with few target samples the prior dominates, and the prefix‑enabled model narrows the gap to the oracle.
Result 2: Prefix Acts as Prior, Not Extra Target Data
A prior‑adaptability check fixes the target context and varies only the prior prefix. The model’s output logits change systematically with the prefix and align more closely with oracle MCMC than with a pooled‑data MCMC, demonstrating genuine conditioning on the prior.
Result 3: Robust Generalization under Heavy‑Tailed Priors
Using Student‑t priors with decreasing degrees of freedom, the multi‑task ICL maintains low divergence when the training meta‑distribution covers sufficient tail heaviness. When test priors become more extreme, performance drops sharply unless the training mixture already includes comparably heavy tails, indicating a clear generalization threshold.
Result 4: Speed Advantage with High‑Dimensional Flow Priors
Flow‑based priors transform a Gaussian base into structured high‑dimensional task distributions. Multi‑task ICL achieves oracle‑level prediction quality with inference times orders of magnitude lower than MCMC, whose warm‑up and per‑sample costs dominate.
Real‑World Evaluation: ERA5 Temperature Forecasting
Prior datasets are taken from one time window and the target from another. In the IID split, adding two prior datasets improves validation and test performance. In the OOD split, gains are mixed; when seasonal shifts misalign validation and test distributions, the prior prefix can hurt test performance, illustrating sensitivity to prior relevance.
Significance
The method decouples priors from model weights by representing them as explicit data‑driven prefixes. This enables hierarchical Bayesian prediction, eliminates the need for fine‑tuning when priors change, and aligns amortized neural inference with traditional Bayesian oracles.
Code repository: https://github.com/martianmartina/multi-task-bayesian-icl/
Paper: https://arxiv.org/abs/2606.20538
Limitations
Transformer attention cost grows quadratically with sequence length, making many prior tasks expensive.
The architecture does not guarantee permutation invariance of dataset ordering; empirical sensitivity is low but not theoretically ensured.
Performance depends on the quality of the prior prefix; mismatched priors can degrade predictions, as seen in the ERA5 OOD split.
Experiments are limited to synthetic tasks and a single climate dataset; broader validation on clinical, causal, multimodal, and cross‑domain scientific tasks is needed.
Code example
来源:专知
本文
约4000字
,建议阅读
5
分钟
如果我们把上下文学习看成一种近似贝叶斯推断,那么模型到底如何知道“先验”是什么?Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
