OpenAI’s latest LLM, GPT-4.5, just became the first AI model to ever pass the Turing Test – a test that’s long been held as a threshold for evaluating machine intelligence. But is it a fluke or the real deal?

Turing Test

AI representative image; Photo: SuPatMaN:Shutterstock
AI representative image; Photo: SuPatMaN/Shutterstock

According to scientists from the University of California, San Diego, current LLMs are likely to be able to replace humans for short-term conversations. This could potentially lead to “automation of jobs, improved social engineering attacks, and more general societal disruption.”

Participants in a recent test reportedly mistook GPT-4.5 for a human 73 percent of the time, which is significantly above the 50 percent rate for random chance. Though this is a significant achievement in engineering, it doesn’t necessarily mean we’ve achieved artificial general intelligence (AGI).

“The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test,” the authors wrote. “The results have implications for debates about what kind of intelligence is exhibited by LLMs, and the social and economic impacts these systems are likely to have.”

The authors of the study instructed the LLM to assume a “humanlike persona,” which essentially resulted in texts full of internet shorthand and socially awkward responses. This persona allowed the LLM to score high, but without it, it only had a 36 percent success rate.

This was a three-party test – meaning that participants simultaneously spoke with a real human and an AI and attempted to distinguish between them. Cameron Jones, a co-author of the study, described this kind of test (which lasts around five minutes) as the “most widely accepted standard” version of the Turing test in a post on X.

Did We Achieve AGI?

But does this mean that we’ve officially developed AGI? According to experts, the Turing test only evaluates one type of intelligence, whereas humans arguably possess upwards of nine distinct intelligences (intrapersonal, interpersonal, existential, visual-spatial, etc). Some researchers actually believe the results are saying more about humans than the AI models.

“It’s no longer a test of machines, it’s a test of us. And increasingly, we’re failing. Because we no longer evaluate humanity based on cognitive substance. We evaluate it based on how it makes us feel. And that feeling—the “gut instinct,” the “vibe”—is now the soft underbelly of our discernment. And LLMs, especially when persona-primed, can exploit it with uncanny accuracy.” John Nosta, founder of the think tank NostaLab, wrote in Psychology Today.