Can a ‘Centaur’ AI Model Truly Predict Human Decisions? A Deep Dive

This article reviews the Centaur foundation model—fine‑tuned from Llama 3‑70B on the Psych‑101 dataset—to assess its ability to predict human choices, brain activity, and decision rationales across diverse psychological experiments, while discussing generalization, over‑fitting, and future research limits.

Data Party THU
Data Party THU
Data Party THU
Can a ‘Centaur’ AI Model Truly Predict Human Decisions? A Deep Dive

Model and Dataset

The Helmholtz Munich team introduced a foundation model named Centaur . To train it, they assembled the Psych‑101 corpus, which aggregates trial‑level data from 160 published psychology experiments. The corpus contains decisions from 60,092 participants, totaling 10.68 million choices. Each trial is rendered as a standardized text record of roughly 32 k words, preserving the participant’s choice, experimental context, and any relevant stimulus information.

Figure 1: Psych‑101 collection and Centaur training pipeline
Figure 1: Psych‑101 collection and Centaur training pipeline

Fine‑tuning Procedure

Using the open‑source Llama 3‑70B model as a base, the authors performed supervised fine‑tuning on the Psych‑101 corpus. The fine‑tuning objective was to predict the recorded human choice given the textual description of the trial. After convergence, the resulting model is referred to as Centaur .

Benchmarking Across Decision‑Making Tasks

Centaur’s predictive performance was compared against the un‑fine‑tuned Llama 3 on a battery of decision‑making tasks that span classic cognitive‑psychology paradigms:

Multi‑armed bandit (MAB) problems with dynamic reward probabilities.

Multi‑cue judgment tasks requiring integration of several informational cues.

Temporal response tasks (reaction‑time prediction).

Weather‑forecasting style prediction tasks.

Balloon‑risk simulations (risk‑taking under uncertainty).

Descriptive decision‑making scenarios (e.g., choice‑over‑options with verbal descriptions).

Across all categories, Centaur achieved statistically significant gains in prediction accuracy and log‑likelihood relative to the baseline. The largest improvement was observed in the MAB task, where a simple heuristic (repeat the last rewarded lever) underperforms human‑like strategies, but Centaur captures the nuanced exploration‑exploitation balance exhibited by participants.

Figure 2: Prediction performance of Centaur vs. baseline Llama across tasks
Figure 2: Prediction performance of Centaur vs. baseline Llama across tasks

Generalization to Novel Scenarios

To assess over‑fitting, the authors generated systematic variations of the original experiment narratives (e.g., swapping a spaceship mission for a magical‑carpet quest). Centaur maintained high prediction accuracy on these story variants, indicating robustness to superficial contextual changes.

Furthermore, the model was evaluated on entirely new problem types that were not present in Psych‑101, such as conceptual reasoning tasks requiring causal inference. Even in these out‑of‑distribution settings, Centaur outperformed the baseline Llama, demonstrating genuine generalization.

Figure 4: Centaur performance on story variants and novel problem types
Figure 4: Centaur performance on story variants and novel problem types

Neuro‑imaging Validation

An fMRI study with 94 participants recorded neural activation while subjects performed decision‑making trials. The authors compared the trial‑by‑trial predictions of Centaur and the baseline Llama to the observed BOLD responses in decision‑related brain regions (e.g., dorsolateral prefrontal cortex, ventromedial prefrontal cortex). Centaur’s predicted choices showed a stronger correlation with the measured activation patterns, suggesting that the model captures aspects of the neural substrates underlying human decisions.

Figure 5: Brain‑activity alignment of Centaur vs. baseline Llama
Figure 5: Brain‑activity alignment of Centaur vs. baseline Llama

Extracting Interpretable Decision Heuristics

Centaur was used to simulate participants in a multi‑attribute decision‑making task where each option is evaluated by several expert estimates with differing confidence levels. The simulated choice set was fed to the language model Deepseek‑R1, which was prompted to infer the underlying decision rule. Deepseek‑R1 identified a "minimum‑regret" heuristic: participants appear to choose the option that minimizes expected post‑decision regret.

When the minimum‑regret rule was implemented directly as a deterministic policy, its predictive accuracy matched that of Centaur, confirming that the heuristic faithfully captures the dominant strategy in the simulated data.

Figure 6: Minimum‑regret heuristic derived from Centaur‑generated data
Figure 6: Minimum‑regret heuristic derived from Centaur‑generated data

Limitations and Caveats

Dataset bias: Psych‑101 is dominated by experiments conducted with educated Western participants, limiting cultural and demographic generality.

Lack of moral reasoning framework: The model predicts choices in moral dilemmas but does not generate an explicit moral justification or value system.

Restricted behavioral diversity: Fine‑tuned LLMs exhibit limited variance and struggle to emulate extreme psychological states observed in classic experiments (e.g., Stanford Prison, Milgram).

Heuristic extraction dependence on training data: Models like Deepseek‑R1 may retrieve known heuristics from their pre‑training corpora rather than discover novel principles.

Future work should expand the training corpus to include more diverse cultural and socioeconomic groups, improve interpretability of internal representations, and explore hybrid pipelines where LLMs assist but do not replace human participants in experimental designs.

References

[1] Binz, M. et al. (2025). A foundation model to predict and capture human cognition. Nature . https://doi.org/10.1038/s41586-025-09215-4

[2] Gandhi, K. et al. (2024). Human‑like affective cognition in foundation models. arXiv . https://arxiv.org/abs/2409.11733

[3] Zhang, Y. et al. (2025). The high‑dimensional psychological profile of ChatGPT. Science China Technological Sciences . https://doi.org/10.1007/s11431-025-2934-8

[4] Abdurahman, S. et al. (2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus , 3. https://doi.org/10.1093/pnasnexus/pgae245

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelPsychologyfoundation modelcognitive modelingCentaurdecision prediction
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.