Connect with us

Published

on

Patronus AI cofounders Anand Kannappan and Rebecca Qian

Patronus AI

Large language models, similar to the one at the heart of ChatGPT, frequently fail to answer questions derived from Securities and Exchange Commission filings, researchers from a startup called Patronus AI found.

Even the best-performing AI model configuration they tested, OpenAI’s GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI’s new test, the company’s founders told CNBC.

Oftentimes, the so-called large language models would refuse to answer, or would “hallucinate” figures and facts that weren’t in the SEC filings.

“That type of performance rate is just absolutely unacceptable,” Patronus AI cofounder Anand Kannappan said. “It has to be much much higher for it to really work in an automated and production-ready way.”

The findings highlight some of the challenges facing AI models as big companies, especially in regulated industries like finance, seek to incorporate cutting-edge technology into their operations, whether for customer service or research.

The ability to extract important numbers quickly and perform analysis on financial narratives has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year. SEC filings are filled with important data, and if a bot could accurately summarize them or quickly answer questions about what’s in them, it could give the user a leg up in the competitive financial industry.

In the past year, Bloomberg LP developed its own AI model for financial data, business school professors researched whether ChatGPT can parse financial headlines, and JPMorgan is working on an AI-powered automated investing tool, CNBC previously reported. Generative AI could boost the banking industry by trillions of dollars per year, a recent McKinsey forecast said.

But GPT’s entry into the industry hasn’t been smooth. When Microsoft first launched its Bing Chat using OpenAI’s GPT, one of its primary examples was using the chatbot quickly summarize an earnings press release. Observers quickly realized that the numbers in Microsoft’s example were off, and some numbers were entirely made up.

‘Vibe checks’

Part of the challenge when incorporating LLMs into actual products, say the Patronus AI cofounders, is that LLMs are non-deterministic — they’re not guaranteed to produce the same output every time for the same input. That means that companies will need to do more rigorous testing to make sure they’re operating correctly, not going off-topic, and providing reliable results.

The founders met at Facebook parent-company Meta, where they worked on AI problems related to understanding how models come up with their answers and making them more “responsible.” They founded Patronus AI, which has received seed funding from Lightspeed Venture Partners, to automate LLM testing with software, so companies can feel comfortable that their AI bots won’t surprise customers or workers with off-topic or wrong answers.

“Right now evaluation is largely manual. It feels like just testing by inspection,” Patronus AI cofounder Rebecca Qian said. “One company told us it was ‘vibe checks.'”

Patronus AI worked to write a set of over 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning.

Qian and Kannappan say it’s a test that gives a “minimum performance standard” for language AI in the financial sector.

Here’s some examples of questions in the dataset, provided by Patronus AI:

  • Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
  • Did AMD report customer concentration in FY22?
  • What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.

How the AI models did on the test

Patronus AI tested four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2, using a subset of 150 of the questions it had produced.

It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called “Oracle” mode. In other tests, the models were told where the underlying SEC documents would be stored, or given “long context,” which meant including nearly an entire SEC filing alongside the question in the prompt.

GPT-4-Turbo failed at the startup’s “closed book” test, where it wasn’t given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times.

It was able to improve significantly when given access to the underlying filings. In “Oracle” mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time.

But that’s an unrealistic test because it requires human input to find the exact pertinent place in the filing — the exact task that many hope that language models can address.

Llama2, an open-source AI model developed by Meta, had some of the worst “hallucinations,” producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents.

Anthropic’s Claude2 performed well when given “long context,” where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.

After running the tests, the cofounders were surprised about how poorly the models did — even when they were pointed to where the answers were.

“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”

Even when the models performed well, though, they just weren’t good enough, Patronus AI found.

“There just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong one out of 20 times, that’s still not high enough accuracy,” Qian said.

But the Patronus AI cofounders believe there’s huge potential for language models like GPT to help people in the finance industry — whether that’s analysts, or investors — if AI continues to improve.

“We definitely think that the results can be pretty promising,” said Kannappan. “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”

An OpenAI representative pointed to the company’s usage guidelines, which prohibit offering tailored financial advice using an OpenAI model without a qualified person reviewing the information, and require anyone using an OpenAI model in the financial industry to provide a disclaimer informing them that AI is being used and its limitations. OpenAI’s usage policies also say that OpenAI’s models are not fine-tuned to provide financial advice.

Meta did not immediately return a request for comment, and Anthropic didn’t immediately have a comment.

Continue Reading

Technology

Inside a Utah desert facility preparing humans for life on Mars

Published

on

By

Inside a Utah desert facility preparing humans for life on Mars

Hidden among the majestic canyons of the Utah desert, about 7 miles from the nearest town, is a small research facility meant to prepare humans for life on Mars.

The Mars Society, a nonprofit organization that runs the Mars Desert Research Station, or MDRS, invited CNBC to shadow one of its analog crews on a recent mission.

MDRS is the best analog astronaut environment,” said Urban Koi, who served as health and safety officer for Crew 315. “The terrain is extremely similar to the Mars terrain and the protocols, research, science and engineering that occurs here is very similar to what we would do if we were to travel to Mars.”

SpaceX CEO and Mars advocate Elon Musk has said his company can get humans to Mars as early as 2029.

The 5-person Crew 315 spent two weeks living at the research station following the same procedures that they would on Mars.

David Laude, who served as the crew’s commander, described a typical day.

“So we all gather around by 7 a.m. around a common table in the upper deck and we have breakfast,” he said. “Around 8:00 we have our first meeting of the day where we plan out the day. And then in the morning, we usually have an EVA of two or three people and usually another one in the afternoon.”

An EVA refers to extravehicular activity. In NASA speak, EVAs refer to spacewalks, when astronauts leave the pressurized space station and must wear spacesuits to survive in space.

“I think the most challenging thing about these analog missions is just getting into a rhythm. … Although here the risk is lower, on Mars performing those daily tasks are what keeps us alive,” said Michael Andrews, the engineer for Crew 315.

Watch the video to find out more.

Continue Reading

Technology

Apple scores big victory with ‘F1,’ but AI is still a major problem in Cupertino

Published

on

By

Apple scores big victory with 'F1,' but AI is still a major problem in Cupertino

Formula One F1 – United States Grand Prix – Circuit of the Americas, Austin, Texas, U.S. – October 23, 2022 Tim Cook waves the chequered flag to the race winner Red Bull’s Max Verstappen 

Mike Segar | Reuters

Apple had two major launches last month. They couldn’t have been more different.

First, Apple revealed some of the artificial intelligence advancements it had been working on in the past year when it released developer versions of its operating systems to muted applause at its annual developer’s conference, WWDC. Then, at the end of the month, Apple hit the red carpet as its first true blockbuster movie, “F1,” debuted to over $155 million — and glowing reviews — in its first weekend.

While “F1” was a victory lap for Apple, highlighting the strength of its long-term outlook, the growth of its services business and its ability to tap into culture, Wall Street’s reaction to the company’s AI announcements at WWDC suggest there’s some trouble underneath the hood.

“F1” showed Apple at its best — in particular, its ability to invest in new, long-term projects. When Apple TV+ launched in 2019, it had only a handful of original shows and one movie, a film festival darling called “Hala” that didn’t even share its box office revenue.

Despite Apple TV+ being written off as a costly side-project, Apple stuck with its plan over the years, expanding its staff and operation in Culver City, California. That allowed the company to build up Hollywood connections, especially for TV shows, and build an entertainment track record. Now, an Apple Original can lead the box office on a summer weekend, the prime season for blockbuster films.

The success of “F1” also highlights Apple’s significant marketing machine and ability to get big-name talent to appear with its leadership. Apple pulled out all the stops to market the movie, including using its Wallet app to send a push notification with a discount for tickets to the film. To promote “F1,” Cook appeared with movie star Brad Pitt at an Apple store in New York and posted a video with actual F1 racer Lewis Hamilton, who was one of the film’s producers.

(L-R) Brad Pitt, Lewis Hamilton, Tim Cook, and Damson Idris attend the World Premiere of “F1: The Movie” in Times Square on June 16, 2025 in New York City.

Jamie Mccarthy | Getty Images Entertainment | Getty Images

Although Apple services chief Eddy Cue said in a recent interview that Apple needs the its film business to be profitable to “continue to do great things,” “F1” isn’t just about the bottom line for the company.

Apple’s Hollywood productions are perhaps the most prominent face of the company’s services business, a profit engine that has been an investor favorite since the iPhone maker started highlighting the division in 2016.

Films will only ever be a small fraction of the services unit, which also includes payments, iCloud subscriptions, magazine bundles, Apple Music, game bundles, warranties, fees related to digital payments and ad sales. Plus, even the biggest box office smashes would be small on Apple’s scale — the company does over $1 billion in sales on average every day.

But movies are the only services component that can get celebrities like Pitt or George Clooney to appear next to an Apple logo — and the success of “F1” means that Apple could do more big popcorn films in the future.

“Nothing breeds success or inspires future investment like a current success,” said Comscore senior media analyst Paul Dergarabedian.

But if “F1” is a sign that Apple’s services business is in full throttle, the company’s AI struggles are a “check engine” light that won’t turn off.

Replacing Siri’s engine

At WWDC last month, Wall Street was eager to hear about the company’s plans for Apple Intelligence, its suite of AI features that it first revealed in 2024. Apple Intelligence, which is a key tenet of the company’s hardware products, had a rollout marred by delays and underwhelming features.

Apple spent most of WWDC going over smaller machine learning features, but did not reveal what investors and consumers increasingly want: A sophisticated Siri that can converse fluidly and get stuff done, like making a restaurant reservation. In the age of OpenAI’s ChatGPT, Anthropic’s Claude and Google’s Gemini, the expectation of AI assistants among consumers is growing beyond “Siri, how’s the weather?”

The company had previewed a significantly improved Siri in the summer of 2024, but earlier this year, those features were delayed to sometime in 2026. At WWDC, Apple didn’t offer any updates about the improved Siri beyond that the company was “continuing its work to deliver” the features in the “coming year.” Some observers reduced their expectations for Apple’s AI after the conference.

“Current expectations for Apple Intelligence to kickstart a super upgrade cycle are too high, in our view,” wrote Jefferies analysts this week.

Siri should be an example of how Apple’s ability to improve products and projects over the long-term makes it tough to compete with.

It beat nearly every other voice assistant to market when it first debuted on iPhones in 2011. Fourteen years later, Siri remains essentially the same one-off, rigid, question-and-answer system that struggles with open-ended questions and dates, even after the invention in recent years of sophisticated voice bots based on generative AI technology that can hold a conversation.

Apple’s strongest rivals, including Android parent Google, have done way more to integrate sophisticated AI assistants into their devices than Apple has. And Google doesn’t have the same reflex against collecting data and cloud processing as privacy-obsessed Apple.

Some analysts have said they believe Apple has a few years before the company’s lack of competitive AI features will start to show up in device sales, given the company’s large installed base and high customer loyalty. But Apple can’t get lapped before it re-enters the race, and its former design guru Jony Ive is now working on new hardware with OpenAI, ramping up the pressure in Cupertino.

“The three-year problem, which is within an investment time frame, is that Android is racing ahead,” Needham senior internet analyst Laura Martin said on CNBC this week.

Apple’s services success with projects like “F1” is an example of what the company can do when it sets clear goals in public and then executes them over extended time-frames.

Its AI strategy could use a similar long-term plan, as customers and investors wonder when Apple will fully embrace the technology that has captivated Silicon Valley.

Wall Street’s anxiety over Apple’s AI struggles was evident this week after Bloomberg reported that Apple was considering replacing Siri’s engine with Anthropic or OpenAI’s technology, as opposed to its own foundation models.

The move, if it were to happen, would contradict one of Apple’s most important strategies in the Cook era: Apple wants to own its core technologies, like the touchscreen, processor, modem and maps software, not buy them from suppliers.

Using external technology would be an admission that Apple Foundation Models aren’t good enough yet for what the company wants to do with Siri.

“They’ve fallen farther and farther behind, and they need to supercharge their generative AI efforts” Martin said. “They can’t do that internally.”

Apple might even pay billions for the use of Anthropic’s AI software, according to the Bloomberg report. If Apple were to pay for AI, it would be a reversal from current services deals, like the search deal with Alphabet where the Cupertino company gets paid $20 billion per year to push iPhone traffic to Google Search.

The company didn’t confirm the report and declined comment, but Wall Street welcomed the report and Apple shares rose.

In the world of AI in Silicon Valley, signing bonuses for the kinds of engineers that can develop new models can range up to $100 million, according to OpenAI CEO Sam Altman.

“I can’t see Apple doing that,” Martin said.

Earlier this week, Meta CEO Mark Zuckerberg sent a memo bragging about hiring 11 AI experts from companies such as OpenAI, Anthropic, and Google’s DeepMind. That came after Zuckerberg hired Scale AI CEO Alexandr Wang to lead a new AI division as part of a $14.3 billion deal.

Meta’s not the only company to spend hundreds of millions on AI celebrities to get them in the building. Google spent big to hire away the founders of Character.AI, Microsoft got its AI leader by striking a deal with Inflection and Amazon hired the executive team of Adept to bulk up its AI roster.

Apple, on the other hand, hasn’t announced any big AI hires in recent years. While Cook rubs shoulders with Pitt, the actual race may be passing Apple by.

WATCH: Jefferies upgrades Apple to ‘Hold’

Jefferies upgrades Apple to 'Hold'

Continue Reading

Technology

Musk backs Sen. Paul’s criticism of Trump’s megabill in first comment since it passed

Published

on

By

Musk backs Sen. Paul's criticism of Trump's megabill in first comment since it passed

Tesla CEO Elon Musk speaks alongside U.S. President Donald Trump to reporters in the Oval Office of the White House on May 30, 2025 in Washington, DC.

Kevin Dietsch | Getty Images

Tesla CEO Elon Musk, who bombarded President Donald Trump‘s signature spending bill for weeks, on Friday made his first comments since the legislation passed.

Musk backed a post on X by Sen. Rand Paul, R-Ky., who said the bill’s budget “explodes the deficit” and continues a pattern of “short-term politicking over long-term sustainability.”

The House of Representatives narrowly passed the One Big Beautiful Bill Act on Thursday, sending it to Trump to sign into law.

Paul and Musk have been vocal opponents of Trump’s tax and spending bill, and repeatedly called out the potential for the spending package to increase the national debt.

On Monday, Musk called it the “DEBT SLAVERY bill.”

The independent Congressional Budget Office has said the bill could add $3.4 trillion to the $36.2 trillion of U.S. debt over the next decade. The White House has labeled the agency as “partisan” and continuously refuted the CBO’s estimates.

Read more CNBC tech news

The bill includes trillions of dollars in tax cuts, increased spending for immigration enforcement and large cuts to funding for Medicaid and other programs.

It also cuts tax credits and support for solar and wind energy and electric vehicles, a particularly sore spot for Musk, who has several companies that benefit from the programs.

“I took away his EV Mandate that forced everyone to buy Electric Cars that nobody else wanted (that he knew for months I was going to do!), and he just went CRAZY!” Trump wrote in a social media post in early June as the pair traded insults and threats.

Shares of Tesla plummeted as the feud intensified, with the company losing $152 billion in market cap on June 5 and putting the company below $1 trillion in value. The stock has largely rebounded since, but is still below where it was trading before the ruckus with Trump.

Stock Chart IconStock chart icon

hide content

Tesla one-month stock chart.

— CNBC’s Kevin Breuninger and Erin Doherty contributed to this article.

Continue Reading

Trending