Artificial Intelligence & Machine Learning Articles



openai gpt :: Article Creator

OpenAI Set To Launch GPT-4.1 And Other Models As Early As Next Week

Key Takeaways
  • OpenAI is preparing to release several new AI models, potentially as early as next week.
  • The flagship release will likely be GPT-4.1 — an enhanced version of the GPT-4o multimodal model.
  • Share this article

    OpenAI plans to launch several new AI models, including GPT-4.1, a revamped version of its GPT-4o multimodal model, The Verge reported today, citing sources familiar with the company's plans.

    The company is expected to release GPT-4.1 alongside smaller GPT-4.1 mini and nano versions as early as next week. OpenAI is also preparing to launch the full version of its o3 reasoning model and an o4 mini version.

    The report comes after OpenAI's CEO Sam Altman said earlier this month that the company planned to release the o3 and o4-mini models "in a couple of weeks."

    The release will be part of OpenAI's strategy to incrementally improve its AI offerings before launching the GPT-5 model, which is expected later in 2025.

    AI engineer Tibo Blaho discovered references to o4 mini, o4 mini high, and o3 in a new ChatGPT web version, indicating these additions are imminent.

    The launch timeline could face delays due to capacity issues, according to sources. Last month, OpenAI had to temporarily limit requests due to high demand for its advanced image generation features, with Altman stating "our GPUs are melting" due to usage from ChatGPT's free tier users.

    Share this article


    OpenAI Helps Spammers Plaster 80,000 Sites With Messages That Bypassed Filters

    "AkiraBot's use of LLM-generated spam message content demonstrates the emerging challenges that AI poses to defending websites against spam attacks," SentinelLabs researchers Alex Delamotte and Jim Walter wrote. "The easiest indicators to block are the rotating set of domains used to sell the Akira and ServiceWrap SEO offerings, as there is no longer a consistent approach in the spam message contents as there were with previous campaigns selling the services of these firms."

    AkiraBot worked by assigning the following role to OpenAI's chat API using the model gpt-4o-mini: "You are a helpful assistant that generates marketing messages." A prompt instructed the LLM to replace the variables with the site name provided at runtime. As a result, the body of each message named the recipient website by name and included a brief description of the service provided by it.

    An AI Chat prompt used by AkiraBot Credit: SentinelLabs

    "The resulting message includes a brief description of the targeted website, making the message seem curated," the researchers wrote. "The benefit of generating each message using an LLM is that the message content is unique and filtering against spam becomes more difficult compared to using a consistent message template which can trivially be filtered."

    SentinelLabs obtained log files AkiraBot left on a server to measure success and failure rates. One file showed that unique messages had been successfully delivered to more than 80,000 websites from September 2024 to January of this year. By comparison, messages targeting roughly 11,000 domains failed. OpenAI thanked the researchers and reiterated that such use of its chatbots runs afoul of its terms of service.

    Story updated to modify headline.


    The Turing Test Has A Problem - And OpenAI's GPT-4.5 Just Exposed It

    Elyse Betters Picaro / ZDNET

    Most people know that the famous Turing Test, a thought experiment conceived by computer pioneer Alan Turing, is a popular measure of progress in artificial intelligence.

    Many mistakenly assume, however, that it is proof that machines are actually thinking.

    The latest research on the Turing Test from scholars at the University of California at San Diego shows that OpenAI's latest large language model, GPT-4.5, can fool humans into thinking that the AI model is a person in text chats, even more than a human can convince another person that they are human.

    Also: How to use ChatGPT: A beginner's guide to the most popular AI chatbot

    That's a breakthrough in the ability of gen AI to produce compelling output in response to a prompt.

    University of California at San Diego Proof of AGI?

    But even the researchers recognize that beating the Turing Test doesn't necessarily mean that "artificial general intelligence," or AGI, has been achieved -- a level of computer processing equivalent to human thought.

    The AI scholar Melanie Mitchell, a professor at the Santa Fe Institute in Santa Fe, New Mexico, has written in the scholarly journal Science that the Turing Test is less a test of intelligence per se and more a test of human assumptions. Despite high scores on the test, "the ability to sound fluent in natural language, like playing chess, is not conclusive proof of general intelligence," wrote Mitchell.

    The latest convincing-sounding achievement is described by Cameron Jones and Benjamin Bergen of UC San Diego in a paper published on the arXiv pre-print server this week, titled "Large Language Models Pass the Turing Test."

    Also: OpenAI expands GPT-4.5 rollout. Here's how to access (and what it can do for you)

    The paper is the latest installment in an experiment that Jones and Bergen have been running for years with participation from UC San Diego undergrads from the department.

    As the authors note, there have been decades of work on the problem. To date, there have been "more than 800 separate claims and counter-arguments having been made" about computers passing the test.

    How The Turing Test works

    The Turing Test was classically conceived by Turing as a round of passing text messages between a human "judge" and two "witnesses," one a human and one a computer.

    The computer and human witnesses were charged with convincing the human judge that they were human by the messages each sent. The judge knows only one of the two is human, but not which is which, and has to guess.

    That three-way form is essential. It means that if the judge mistakenly deems a computer human, then the judge was also missing the clues they should have gotten from the human about humanness.

    Also: With AI models clobbering every benchmark, it's time for human evaluation

    In other words, it's a test as much about how humans perceive and believe as it is about computer functioning.

    Jones and Bergen's updated design is posted online, and you can play the game yourself, as either judge or witness.

    As in prior versions, the human students at UC San Diego were asked to engage in a text chat app. Some of the students played the judge and some the human witness in randomized five-minute chat exchanges, while GPT-4.5 played the computer witness. Each judge had simultaneous conversations with both a human and a computer witness to fit the three-way form of Turing's formulation.

    To play the computer witness, multiple large language models were used, input with the judge's challenges in a prompt. GPT-4.5 was tested with two different prompts. One prompt was given minimal information. But a second, called PERSONA, was given a lot of extra details in the prompt, including "additional instructions on what kind of persona to adopt in responding to the interrogator: specifically a young person who is introverted, knowledgeable about internet culture, and uses slang," the authors write.

    University of California at San Diego The results

    Three other AI programs were tested: OpenAI's previous model, GPT-4o; Meta Platforms's Llama 3.1 405b; and a very ancient chat bot program from the 1960s, called ELIZA. ELIZA was included because, as a more primitive program, it could act as a check on whether the large language models fooled a human simply by random chance or because they were actually better programs.

    GPT-4.5, report Jones and Bergen, had a "win rate" of 73%, meaning it fooled the human judge into declaring it human 73% of the time. By contrast, the version of GPT-4.5 with the minimalist prompt and the three other AI programs did nowhere near as well, with scores ranging from 21% to 56%, even when the other language models were given a similarly rich PERSONA prompt.

    Also: Are ChatGPT Plus or Pro worth it? Here's how they compare to the free version

    Moreover, GPT-4.5 was way above a test of GPT-4 last year by Jones and Bergen. It had a win rate of only 54%, just above random chance.

    Jones and Bergen conclude that "interrogators were not only unable to identify the real human witness, but were in fact more likely to believe this model was human than that other human participants were."

    Is the test actually a measure of intelligence?

    So, you might ask what it means that humans do a poor job of telling a computer and a person apart based on chat messages.

    The "most controversial question" about Turing's problem over the decades is whether it is actually measuring intelligence, Jones and Bergen acknowledge.

    One way of looking at it is that machines have gotten so good that they can "adapt their behaviour to different scenarios that makes them so flexible: and apparently so capable of passing as human," they observe. The PERSONA prompt, created by humans, is something to which GPT-4.5 "adapted" itself to win.

    Again, it's a genuine technical breakthrough in the AI model's capabilities.

    Also: ChatGPT Plus is free for students now - how to grab this deal before finals

    However, a big complaint with the test is that humans might simply be bad at recognizing intelligence. The authors conclude their experiment is evidence of that, at least partially.

    They note that 23% of the time, the older ELIZA program fooled human judges. As they relate, that was not because it was somehow obviously smarter. "Many participants selected ELIZA because it did not meet their expectations of an AI system (e.G. 'they were sarcastic' or 'I don't think AI would be so rude'.)," they write.

    Those guesses, they write, "suggest that interrogators' decisions incorporate complex assumptions about how humans and AI systems might be likely to behave in these contexts, beyond simply selecting the most intelligent-seeming agent."

    In fact, the human judges didn't ask very much about knowledge in their challenges, even though Turing thought that would be the main criterion. "[O]ne of the reasons most predictive of accurate verdicts" by the human judge, they write, "was that a witness was human because they lacked knowledge."

    Sociability, not intelligence

    All this means humans were picking up on things such as sociability rather than intelligence, leading Jones and Bergen to conclude that "Fundamentally, the Turing test is not a direct test of intelligence, but a test of humanlikeness."

    For Turing, intelligence may have appeared to be the biggest barrier for appearing humanlike, and hence to passing the Turing test. But as machines become more similar to us, other contrasts have fallen into sharper relief, to the point where intelligence alone is not sufficient to appear convincingly human.

    Left unsaid by the authors is that humans have become so used to typing into a computer -- to a person or to a machine -- that the Test is no longer a novel test of human-computer interaction. It's a test of online human habits.

    One implication is that the test needs to be expanded. The authors write that "intelligence is complex and multifaceted," and "no single test of intelligence could be decisive."

    Also: Gemini Pro 2.5 is a stunningly capable coding assistant - and a big threat to ChatGPT

    In fact, they suggest the test could come out very different with different designs. Experts in AI, they note, could be tested as a judge cohort. They might judge differently than lay people because they have different expectations of a machine.

    If a financial incentive were added to raise the stakes, human judges might scrutinize more closely and more thoughtfully. Those are indications that attitude and expectations play a part.

    "To the extent that the Turing test does index intelligence, it ought to be considered among other kinds of evidence," they conclude.

    That suggestion seems to square with an increasing trend in the AI research field to involve humans "in the loop," assessing and evaluating what machines do.

    Is human judgement enough?

    Left open is the question of whether human judgment will ultimately be enough. In the movie Blade Runner, the "replicant" robots in their midst have gotten so good that humans rely on a machine, "Voight-Kampff," to detect who's human and who's robot.

    As the quest goes on to reach AGI, and humans realize just how difficult it is to say what AGI is or how they would recognize it if they stumbled upon it, perhaps humans will have to rely on machines to assess machine intelligence.

    Also: 10 key reasons AI went mainstream overnight - and what happens next

    Or, at the very least, they may have to ask machines what machines "think" about humans writing prompts to try to make a machine fool other humans.

    Get the morning's top stories in your inbox each day with our Tech Today newsletter.

    Artificial Intelligence The best AI for coding in 2025 (and what not to use - including DeepSeek R1) I tested DeepSeek's R1 and V3 coding skills - and we're not all doomed (yet) How to remove Copilot from your Microsoft 365 plan How to install an LLM on MacOS (and why you should)




    Comments

    Follow It

    Popular posts from this blog

    What is Generative AI? Everything You Need to Know

    Top Generative AI Tools 2024

    60 Growing AI Companies & Startups (2025)