Large Language Models GPT-4.5 and LLaMa-3.1-405B Pass Standard Turing Test in UCSD Study
A UC San Diego study found that GPT-4.5 was judged human 73% of the time and LLaMa-3.1-405B 56%, demonstrating that both large language models can pass a standard three‑party Turing test, with detailed methodology, results, and analysis of judge behavior.
Researchers from UC San Diego reported that the AI systems LLaMa-3.1-405B and GPT-4.5 successfully passed a standard three‑party Turing test, with GPT‑4.5 being judged human 73% of the time and LLaMa‑3.1 judged human 56%.
The study evaluated four systems—GPT‑4.5, LLaMa‑3.1‑405B, GPT‑4o, and ELIZA—using two prompt types (NO‑PERSONA and PERSONA) across two participant pools (UCSD psychology undergraduates and Prolific users). Each trial involved a judge conversing with two humans and one AI for five‑minute rounds, repeated over eight rounds.
Results showed that GPT‑4.5‑PERSONA achieved a 73% win rate, LLaMa‑PERSONA 56%, while the NO‑PERSONA variants performed lower (36% and 38%). Baseline models ELIZA and GPT‑4o‑NO‑PERSONA had the lowest win rates (23% and 21%).
Judges could reliably distinguish ELIZA from humans, but their accuracy did not exceed random chance when evaluating GPT‑4.5‑PERSONA and LLaMa‑PERSONA, indicating difficulty in telling these models apart from humans.
Analysis of judge strategies revealed that casual small‑talk and probing personal details were the most common tactics; judges who employed unusual utterances or LLM “jailbreak” techniques achieved higher accuracy.
Figures illustrate win rates with 95% bootstrap confidence intervals and judges’ confidence distributions. The full study is available on arXiv: https://arxiv.org/pdf/2503.23674 titled Large Language Models Pass the Turing Test .
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.