Artificial Intelligence 5 min read

Large Language Models GPT-4.5 and LLaMa-3.1-405B Pass Standard Turing Test in UCSD Study

A UC San Diego study found that GPT-4.5 was judged human 73% of the time and LLaMa-3.1-405B 56%, demonstrating that both large language models can pass a standard three‑party Turing test, with detailed methodology, results, and analysis of judge behavior.

DataFunTalk

Apr 3, 2025

Large Language Models GPT-4.5 and LLaMa-3.1-405B Pass Standard Turing Test in UCSD Study

Researchers from UC San Diego reported that the AI systems LLaMa-3.1-405B and GPT-4.5 successfully passed a standard three‑party Turing test, with GPT‑4.5 being judged human 73% of the time and LLaMa‑3.1 judged human 56%.

The study evaluated four systems—GPT‑4.5, LLaMa‑3.1‑405B, GPT‑4o, and ELIZA—using two prompt types (NO‑PERSONA and PERSONA) across two participant pools (UCSD psychology undergraduates and Prolific users). Each trial involved a judge conversing with two humans and one AI for five‑minute rounds, repeated over eight rounds.

Results showed that GPT‑4.5‑PERSONA achieved a 73% win rate, LLaMa‑PERSONA 56%, while the NO‑PERSONA variants performed lower (36% and 38%). Baseline models ELIZA and GPT‑4o‑NO‑PERSONA had the lowest win rates (23% and 21%).

Judges could reliably distinguish ELIZA from humans, but their accuracy did not exceed random chance when evaluating GPT‑4.5‑PERSONA and LLaMa‑PERSONA, indicating difficulty in telling these models apart from humans.

Analysis of judge strategies revealed that casual small‑talk and probing personal details were the most common tactics; judges who employed unusual utterances or LLM “jailbreak” techniques achieved higher accuracy.

Figures illustrate win rates with 95% bootstrap confidence intervals and judges’ confidence distributions. The full study is available on arXiv: https://arxiv.org/pdf/2503.23674 titled Large Language Models Pass the Turing Test.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI evaluation Turing Test GPT-4.5 Llama 3.1

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.