Can a ‘Centaur’ AI Model Truly Predict Human Decisions? A Deep Dive
This article reviews the Centaur foundation model—fine‑tuned from Llama 3‑70B on the Psych‑101 dataset—to assess its ability to predict human choices, brain activity, and decision rationales across diverse psychological experiments, while discussing generalization, over‑fitting, and future research limits.
Model and Dataset
The Helmholtz Munich team introduced a foundation model named Centaur . To train it, they assembled the Psych‑101 corpus, which aggregates trial‑level data from 160 published psychology experiments. The corpus contains decisions from 60,092 participants, totaling 10.68 million choices. Each trial is rendered as a standardized text record of roughly 32 k words, preserving the participant’s choice, experimental context, and any relevant stimulus information.
Fine‑tuning Procedure
Using the open‑source Llama 3‑70B model as a base, the authors performed supervised fine‑tuning on the Psych‑101 corpus. The fine‑tuning objective was to predict the recorded human choice given the textual description of the trial. After convergence, the resulting model is referred to as Centaur .
Benchmarking Across Decision‑Making Tasks
Centaur’s predictive performance was compared against the un‑fine‑tuned Llama 3 on a battery of decision‑making tasks that span classic cognitive‑psychology paradigms:
Multi‑armed bandit (MAB) problems with dynamic reward probabilities.
Multi‑cue judgment tasks requiring integration of several informational cues.
Temporal response tasks (reaction‑time prediction).
Weather‑forecasting style prediction tasks.
Balloon‑risk simulations (risk‑taking under uncertainty).
Descriptive decision‑making scenarios (e.g., choice‑over‑options with verbal descriptions).
Across all categories, Centaur achieved statistically significant gains in prediction accuracy and log‑likelihood relative to the baseline. The largest improvement was observed in the MAB task, where a simple heuristic (repeat the last rewarded lever) underperforms human‑like strategies, but Centaur captures the nuanced exploration‑exploitation balance exhibited by participants.
Generalization to Novel Scenarios
To assess over‑fitting, the authors generated systematic variations of the original experiment narratives (e.g., swapping a spaceship mission for a magical‑carpet quest). Centaur maintained high prediction accuracy on these story variants, indicating robustness to superficial contextual changes.
Furthermore, the model was evaluated on entirely new problem types that were not present in Psych‑101, such as conceptual reasoning tasks requiring causal inference. Even in these out‑of‑distribution settings, Centaur outperformed the baseline Llama, demonstrating genuine generalization.
Neuro‑imaging Validation
An fMRI study with 94 participants recorded neural activation while subjects performed decision‑making trials. The authors compared the trial‑by‑trial predictions of Centaur and the baseline Llama to the observed BOLD responses in decision‑related brain regions (e.g., dorsolateral prefrontal cortex, ventromedial prefrontal cortex). Centaur’s predicted choices showed a stronger correlation with the measured activation patterns, suggesting that the model captures aspects of the neural substrates underlying human decisions.
Extracting Interpretable Decision Heuristics
Centaur was used to simulate participants in a multi‑attribute decision‑making task where each option is evaluated by several expert estimates with differing confidence levels. The simulated choice set was fed to the language model Deepseek‑R1, which was prompted to infer the underlying decision rule. Deepseek‑R1 identified a "minimum‑regret" heuristic: participants appear to choose the option that minimizes expected post‑decision regret.
When the minimum‑regret rule was implemented directly as a deterministic policy, its predictive accuracy matched that of Centaur, confirming that the heuristic faithfully captures the dominant strategy in the simulated data.
Limitations and Caveats
Dataset bias: Psych‑101 is dominated by experiments conducted with educated Western participants, limiting cultural and demographic generality.
Lack of moral reasoning framework: The model predicts choices in moral dilemmas but does not generate an explicit moral justification or value system.
Restricted behavioral diversity: Fine‑tuned LLMs exhibit limited variance and struggle to emulate extreme psychological states observed in classic experiments (e.g., Stanford Prison, Milgram).
Heuristic extraction dependence on training data: Models like Deepseek‑R1 may retrieve known heuristics from their pre‑training corpora rather than discover novel principles.
Future work should expand the training corpus to include more diverse cultural and socioeconomic groups, improve interpretability of internal representations, and explore hybrid pipelines where LLMs assist but do not replace human participants in experimental designs.
References
[1] Binz, M. et al. (2025). A foundation model to predict and capture human cognition. Nature . https://doi.org/10.1038/s41586-025-09215-4
[2] Gandhi, K. et al. (2024). Human‑like affective cognition in foundation models. arXiv . https://arxiv.org/abs/2409.11733
[3] Zhang, Y. et al. (2025). The high‑dimensional psychological profile of ChatGPT. Science China Technological Sciences . https://doi.org/10.1007/s11431-025-2934-8
[4] Abdurahman, S. et al. (2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus , 3. https://doi.org/10.1093/pnasnexus/pgae245
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
