Artificial Intelligence 15 min read

Multi-Task Bayesian In-Context Learning: Transformers Adapt to New Priors

The ICML 2026 paper reframes in‑context learning as approximate Bayesian inference, introduces explicit prior datasets as a context prefix for Transformers, and demonstrates through synthetic and real‑world experiments that this multi‑task approach closely matches Bayesian oracles while offering fast, controllable inference.

Data Party THU

Jul 2, 2026

Multi-Task Bayesian In-Context Learning: Transformers Adapt to New Priors

Problem Motivation

In‑context learning (ICL) treats a new task as a sequence of examples fed to a Transformer, but existing ICL models embed the prior distribution implicitly in their weights. Consequently, they cannot change the prior at test time, limiting applicability to scenarios where the prior varies across users, domains, or time.

Core Idea: Prior as a Context Prefix

The paper frames each episode as a hierarchical Bayesian model. An episode‑level hyper‑parameter is sampled from a meta‑distribution and generates multiple prior datasets (the first K tasks) followed by a target dataset . The prior datasets are placed at the beginning of the Transformer’s input sequence, the target data follow. During training the model learns to infer the episode‑level prior from the prefix and then produce posterior predictions for the target. At test time, swapping the prefix datasets changes the prior without any parameter fine‑tuning.

Comparison:

Standard ICL: context contains only target evidence; the prior is baked into model weights.

Multi‑task Bayesian ICL: context contains both prior datasets and target data; the prior can be controlled via the prefix.

Model Details

A small GPT‑2‑style Transformer is trained with negative log‑likelihood. For regression tasks the model outputs a Gaussian mean and variance; for classification/logistic regression it outputs class probabilities.

Experimental Setup

Four families of experiments increase in complexity:

Gaussian priors for linear and logistic regression (sanity check against a Bayesian oracle).

Student‑t heavy‑tailed priors to test out‑of‑meta‑distribution generalization.

Flow‑based high‑dimensional priors to assess scalability.

ERA5 spatiotemporal temperature prediction as a real‑world task.

Baselines include an MCMC oracle, SVI, hierarchical MCMC/SVI, and ICL with and without the prior prefix.

Result 1: Prior‑Prefix ICL Matches Bayesian Oracle

In linear regression, KL divergence between model posteriors and the oracle shows that multi‑task ICL with the prior prefix matches hierarchical MCMC across context lengths and test priors, while vanilla ICL degrades under prior shift. Logistic regression experiments show the same pattern: with few target samples the prior dominates, and the prefix‑enabled model narrows the gap to the oracle.

Result 2: Prefix Acts as Prior, Not Extra Target Data

A prior‑adaptability check fixes the target context and varies only the prior prefix. The model’s output logits change systematically with the prefix and align more closely with oracle MCMC than with a pooled‑data MCMC, demonstrating genuine conditioning on the prior.

Result 3: Robust Generalization under Heavy‑Tailed Priors

Using Student‑t priors with decreasing degrees of freedom, the multi‑task ICL maintains low divergence when the training meta‑distribution covers sufficient tail heaviness. When test priors become more extreme, performance drops sharply unless the training mixture already includes comparably heavy tails, indicating a clear generalization threshold.

Result 4: Speed Advantage with High‑Dimensional Flow Priors

Flow‑based priors transform a Gaussian base into structured high‑dimensional task distributions. Multi‑task ICL achieves oracle‑level prediction quality with inference times orders of magnitude lower than MCMC, whose warm‑up and per‑sample costs dominate.

Real‑World Evaluation: ERA5 Temperature Forecasting

Prior datasets are taken from one time window and the target from another. In the IID split, adding two prior datasets improves validation and test performance. In the OOD split, gains are mixed; when seasonal shifts misalign validation and test distributions, the prior prefix can hurt test performance, illustrating sensitivity to prior relevance.

Significance

The method decouples priors from model weights by representing them as explicit data‑driven prefixes. This enables hierarchical Bayesian prediction, eliminates the need for fine‑tuning when priors change, and aligns amortized neural inference with traditional Bayesian oracles.

Code repository: https://github.com/martianmartina/multi-task-bayesian-icl/

Paper: https://arxiv.org/abs/2606.20538

Limitations

Transformer attention cost grows quadratically with sequence length, making many prior tasks expensive.

The architecture does not guarantee permutation invariance of dataset ordering; empirical sensitivity is low but not theoretically ensured.

Performance depends on the quality of the prior prefix; mismatched priors can degrade predictions, as seen in the ERA5 OOD split.

Experiments are limited to synthetic tasks and a single climate dataset; broader validation on clinical, causal, multimodal, and cross‑domain scientific tasks is needed.

Code example

来源：专知
本文
约4000字
，建议阅读
5
分钟
如果我们把上下文学习看成一种近似贝叶斯推断，那么模型到底如何知道“先验”是什么？

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer Bayesian Inference Multi-Task Learning In-Context Learning ICML 2026 Prior Adaptation

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Motivation

Core Idea: Prior as a Context Prefix

Model Details

Experimental Setup

Result 1: Prior‑Prefix ICL Matches Bayesian Oracle

Result 2: Prefix Acts as Prior, Not Extra Target Data

Result 3: Robust Generalization under Heavy‑Tailed Priors

Result 4: Speed Advantage with High‑Dimensional Flow Priors

Real‑World Evaluation: ERA5 Temperature Forecasting

Significance

Limitations

Code example

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

Result 1: Prior‑Prefix ICL Matches Bayesian Oracle

Result 2: Prefix Acts as Prior, Not Extra Target Data

Result 3: Robust Generalization under Heavy‑Tailed Priors

Result 4: Speed Advantage with High‑Dimensional Flow Priors