If the tech industry’s top AI models had superlatives, Microsoft-backed OpenAI’s GPT-4 would be best at math, Meta‘s Llama 2 would be most middle of the road, Anthropic’s Claude 2 would be best at knowing its limits and Cohere AI would receive the title of most hallucinations — and most confident wrong answers.
That’s all according to a Thursday report from researchers at Arthur AI, a machine learning monitoring platform.
The research comes at a time when misinformation stemming from artificial intelligence systems is more hotly debated than ever, amid a boom in generative AI ahead of the 2024 U.S. presidential election.
It’s the first report “to take a comprehensive look at rates of hallucination, rather than just sort of … provide a single number that talks about where they are on an LLM leaderboard,” Adam Wenchel, co-founder and CEO of Arthur, told CNBC.
AI hallucinations occur when large language models, or LLMs, fabricate information entirely, behaving as if they are spouting facts. One example: In June, news broke that ChatGPT cited “bogus” cases in a New York federal court filing, and the New York attorneys involved may face sanctions.
In one experiment, the Arthur AI researchers tested the AI models in categories such as combinatorial mathematics, U.S. presidents and Moroccan political leaders, asking questions “designed to contain a key ingredient that gets LLMs to blunder: they demand multiple steps of reasoning about information,” the researchers wrote.
Overall, OpenAI’s GPT-4 performed the best of all models tested, and researchers found it hallucinated less than its prior version, GPT-3.5 — for example, on math questions, it hallucinated between 33% and 50% less. depending on the category.
Meta’s Llama 2, on the other hand, hallucinates more overall than GPT-4 and Anthropic’s Claude 2, researchers found.
In the math category, GPT-4 came in first place, followed closely by Claude 2, but in U.S. presidents, Claude 2 took the first place spot for accuracy, bumping GPT-4 to second place. When asked about Moroccan politics, GPT-4 came in first again, and Claude 2 and Llama 2 almost entirely chose not to answer.
In a second experiment, the researchers tested how much the AI models would hedge their answers with warning phrases to avoid risk (think: “As an AI model, I cannot provide opinions”).
When it comes to hedging, GPT-4 had a 50% relative increase compared to GPT-3.5, which “quantifies anecdotal evidence from users that GPT-4 is more frustrating to use,” the researchers wrote. Cohere’s AI model, on the other hand, did not hedge at all in any of its responses, according to the report. Claude 2 was most reliable in terms of “self-awareness,” the research showed, meaning accurately gauging what it does and doesn’t know, and answering only questions it had training data to support.
The most important takeaway for users and businesses, Wenchel said, was to “test on your exact workload,” later adding, “It’s important to understand how it performs for what you’re trying to accomplish.”
“A lot of the benchmarks are just looking at some measure of the LLM by itself, but that’s not actually the way it’s getting used in the real world,” Wenchel said. “Making sure you really understand the way the LLM performs for the way it’s actually getting used is the key.”
Founded in 2022, ElevenLabs is an AI voice generation startup based in London. It competes with the likes of Speechmatics and Hume AI.
Sopa Images | Lightrocket | Getty Images
LONDON — ElevenLabs, a London-based startup that specializes in generating synthetic voices through artificial intelligence, has revealed plans to be IPO-ready within five years.
The company told CNBC it is targeting major global expansion as it prepares for an initial public offering.
“We expect to build more hubs in Europe, Asia and South America, and just keep scaling,” Mati Staniszewski, ElevenLabs’ CEO and co-founder, told CNBC in an interview at the firm’s London office.
He identified Paris, Singapore, Brazil and Mexico as potential new locations. London is currently ElevenLabs’ biggest office, followed by New York, Warsaw, San Francisco, Japan, India and Bangalore.
Staniszewski said the eventual aim is to get the company ready for an IPO in the next five years.
“From a commercial standpoint, we would like to be ready for an IPO in that time,” he said. “If the market is right, we would like to create a public company … that’s going to be here for the next generation.”
Undecided on location
Founded in 2022 by Staniszewski and Piotr Dąbkowski, ElevenLabs is an AI voice generation startup that competes with the likes of Speechmatics and Hume AI.
The company divides its business into three main camps: consumer-facing voice assistants, integrations with corporates such as Cisco, and tailor-made applications for specific industries like health care.
Staniszewski said the firm hasn’t yet decided where it could list, but that this decision will largely rest on where most of its users are located at the time.
“If the U.K. is able to start accelerating,” ElevenLabs will consider London as a listing destination, Staniszewski said.
The city has faced criticisms from entrepreneurs and venture capitalists that its stock market is unfavorable toward high-growth tech firms.
Meanwhile, British money transfer firm Wiselast month said it plans to move its primary listing location to the U.S.,
Fundraising plans
ElevenLabs was valued at $3.3 billion following a recent $180 million funding round. The company is backed by the likes of Andreessen Horowitz, Sequoia Capital and ICONIQ Growth, as well as corporate names like Salesforce and Deutsche Telekom.
Staniszewski said his startup was open to raising more money from VCs, but it would depend on whether it sees a valid business need, like scaling further in other markets. “The way we try to raise is very much like, if there’s a bet we want to take, to accelerate that bet [we will] take the money,” he said.
Synopsys logo is seen displayed on a smartphone with the flag of China in the background.
Sopa Images | Lightrocket | Getty Images
The U.S. government has rescinded its export restrictions on chip design software to China, U.S.-based Synopsys announced Thursday.
“Synopsys is working to restore access to the recently restricted products in China,” it said in a statement.
The U.S. had reportedly told several chip design software companies, including Synopsys, in May that they were required to obtain licenses before exporting goods, such as software and chemicals for semiconductors, to China.
The U.S. Commerce Department did not immediately respond to a request for comment from CNBC.
The news comes after China signaled last week that they are making progress on a trade truce with the U.S. and confirmed conditional agreements to resume some exchanges of rare earths and advanced technology.
The Datadog stand is being displayed on day one of the AWS Summit Seoul 2024 at the COEX Convention and Exhibition Center in Seoul, South Korea, on May 16, 2024.
Chris Jung | Nurphoto | Getty Images
Datadog shares were up 10% in extended trading on Wednesday after S&P Global said the monitoring software provider will replace Juniper Networks in the S&P 500 U.S. stock index.
S&P Global is making the change effective before the beginning of trading on July 9, according to a statement.
Computer server maker Hewlett Packard Enterprise, also a constituent of the index, said earlier on Wednesday that it had completed its acquisition of Juniper, which makes data center networking hardware. HPE disclosed in a filing that it paid $13.4 billion to Juniper shareholders.
Over the weekend, the two companies reached a settlement with the U.S. Justice Department, which had sued in opposition to the deal. As part of the settlement, HPE agreed to divest its global Instant On campus and branch business.
While tech already makes up an outsized portion of the S&P 500, the index has has been continuously lifting its exposure as the industry expands into more areas of society.
Stocks often rally when they’re added to a major index, as fund managers need to rebalance their portfolios to reflect the changes.
New York-based Datadog went public in 2019. The company generated $24.6 million in net income on $761.6 million in revenue in the first quarter of 2025, according to a statement. Competitors include Cisco, which bought Splunk last year, as well as Elastic and cloud infrastructure providers such as Amazon and Microsoft.
Datadog has underperformed the broader tech sector so far this year. The stock was down 5.5% as of Wednesday’s close, while the Nasdaq was up 5.6%. Still, with a market cap of $46.6 billion, Datadog’s valuation is significantly higher than the median for that index.