Connect with us

Published

on

Patronus AI cofounders Anand Kannappan and Rebecca Qian

Patronus AI

Large language models, similar to the one at the heart of ChatGPT, frequently fail to answer questions derived from Securities and Exchange Commission filings, researchers from a startup called Patronus AI found.

Even the best-performing AI model configuration they tested, OpenAI’s GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI’s new test, the company’s founders told CNBC.

Oftentimes, the so-called large language models would refuse to answer, or would “hallucinate” figures and facts that weren’t in the SEC filings.

“That type of performance rate is just absolutely unacceptable,” Patronus AI cofounder Anand Kannappan said. “It has to be much much higher for it to really work in an automated and production-ready way.”

The findings highlight some of the challenges facing AI models as big companies, especially in regulated industries like finance, seek to incorporate cutting-edge technology into their operations, whether for customer service or research.

The ability to extract important numbers quickly and perform analysis on financial narratives has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year. SEC filings are filled with important data, and if a bot could accurately summarize them or quickly answer questions about what’s in them, it could give the user a leg up in the competitive financial industry.

In the past year, Bloomberg LP developed its own AI model for financial data, business school professors researched whether ChatGPT can parse financial headlines, and JPMorgan is working on an AI-powered automated investing tool, CNBC previously reported. Generative AI could boost the banking industry by trillions of dollars per year, a recent McKinsey forecast said.

But GPT’s entry into the industry hasn’t been smooth. When Microsoft first launched its Bing Chat using OpenAI’s GPT, one of its primary examples was using the chatbot quickly summarize an earnings press release. Observers quickly realized that the numbers in Microsoft’s example were off, and some numbers were entirely made up.

‘Vibe checks’

Part of the challenge when incorporating LLMs into actual products, say the Patronus AI cofounders, is that LLMs are non-deterministic — they’re not guaranteed to produce the same output every time for the same input. That means that companies will need to do more rigorous testing to make sure they’re operating correctly, not going off-topic, and providing reliable results.

The founders met at Facebook parent-company Meta, where they worked on AI problems related to understanding how models come up with their answers and making them more “responsible.” They founded Patronus AI, which has received seed funding from Lightspeed Venture Partners, to automate LLM testing with software, so companies can feel comfortable that their AI bots won’t surprise customers or workers with off-topic or wrong answers.

“Right now evaluation is largely manual. It feels like just testing by inspection,” Patronus AI cofounder Rebecca Qian said. “One company told us it was ‘vibe checks.'”

Patronus AI worked to write a set of over 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning.

Qian and Kannappan say it’s a test that gives a “minimum performance standard” for language AI in the financial sector.

Here’s some examples of questions in the dataset, provided by Patronus AI:

  • Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
  • Did AMD report customer concentration in FY22?
  • What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.

How the AI models did on the test

Patronus AI tested four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2, using a subset of 150 of the questions it had produced.

It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called “Oracle” mode. In other tests, the models were told where the underlying SEC documents would be stored, or given “long context,” which meant including nearly an entire SEC filing alongside the question in the prompt.

GPT-4-Turbo failed at the startup’s “closed book” test, where it wasn’t given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times.

It was able to improve significantly when given access to the underlying filings. In “Oracle” mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time.

But that’s an unrealistic test because it requires human input to find the exact pertinent place in the filing — the exact task that many hope that language models can address.

Llama2, an open-source AI model developed by Meta, had some of the worst “hallucinations,” producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents.

Anthropic’s Claude2 performed well when given “long context,” where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.

After running the tests, the cofounders were surprised about how poorly the models did — even when they were pointed to where the answers were.

“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”

Even when the models performed well, though, they just weren’t good enough, Patronus AI found.

“There just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong one out of 20 times, that’s still not high enough accuracy,” Qian said.

But the Patronus AI cofounders believe there’s huge potential for language models like GPT to help people in the finance industry — whether that’s analysts, or investors — if AI continues to improve.

“We definitely think that the results can be pretty promising,” said Kannappan. “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”

An OpenAI representative pointed to the company’s usage guidelines, which prohibit offering tailored financial advice using an OpenAI model without a qualified person reviewing the information, and require anyone using an OpenAI model in the financial industry to provide a disclaimer informing them that AI is being used and its limitations. OpenAI’s usage policies also say that OpenAI’s models are not fine-tuned to provide financial advice.

Meta did not immediately return a request for comment, and Anthropic didn’t immediately have a comment.

Continue Reading

Technology

Trump advisor Navarro rips Apple’s Tim Cook for not moving production out of China fast enough

Published

on

By

Trump advisor Navarro rips Apple's Tim Cook for not moving production out of China fast enough

Peter Navarro: 'Inconceivable' that Apple could not produce iPhones outside China

White House trade advisor Peter Navarro chastised Apple CEO Tim Cook on Monday over the company’s response to pressure from the Trump administration to make more of its products outside of China.

“Going back to the first Trump term, Tim Cook has continually asked for more time in order to move his factories out of China,” Navarro said in an interview on CNBC’s “Squawk on the Street.” “I mean it’s the longest-running soap opera in Silicon Valley.”

CNBC has reached out to Apple for comment on Navarro’s criticism.

President Donald Trump has in recent months ramped up demands for Apple to move production of its iconic iPhone to the U.S. from overseas. Apple’s flagship phone is produced primarily in China, but the company has increasingly boosted production in India, partly to avoid the higher cost of Trump’s tariffs.

Trump in May warned Apple would have to pay a tariff of 25% or more for iPhones made outside the U.S. In separate remarks, Trump said he told Cook, “I don’t want you building in India.”

Read more CNBC tech news

Analysts and supply chain experts have argued it would be impossible for Apple to completely move iPhone production to the U.S. By some estimates, a U.S.-made iPhone could cost as much as $3,500.

Navarro said Cook isn’t shifting production out of China quickly enough.

“With all these new advanced manufacturing techniques and the way things are moving with AI and things like that, it’s inconceivable to me that Tim Cook could not produce his iPhones elsewhere around the world and in this country,” Navarro said.

Apple currently makes very few products in the U.S. During Trump’s first term, Apple extended its commitment to assemble the $3,000 Mac Pro in Texas.

In February, Apple said it would spend $500 billion within the U.S., including on assembling some AI servers.

WATCH: Apple’s $500 billion investment: For AI servers not manufacturing iPhones

Apple's $500 billion U.S. investment: For AI servers not manufacturing iPhones

Continue Reading

Technology

CoreWeave to acquire Core Scientific in $9 billion all-stock deal

Published

on

By

CoreWeave to acquire Core Scientific in  billion all-stock deal

CoreWeave founders Brian Venturo, at left in sweatshirt, and Mike Intrator slap five after ringing the opening bell at Nasdaq headquarters in New York on March 28, 2025.

Michael M. Santiago | Getty Images News | Getty Images

Artificial intelligence hyperscaler CoreWeave said Monday it will acquire Core Scientific, a leading data center infrastructure provider, in an all-stock deal valued at approximately $9 billion.

Coreweave stock fell about 4% on Monday while Core Scientific stock plummeted about 20%. Shares of both companies rallied at the end of June after the Wall Street Journal reported that talks were underway for an acquisition.

The deal strengthens CoreWeave’s position in the AI arms race by bringing critical infrastructure in-house.

CoreWeave CEO Michael Intrator said the move will eliminate $10 billion in future lease obligations and significantly enhance operating efficiency.

The transaction is expected to close in the fourth quarter of 2025, pending regulatory and shareholder approval.

Read more CNBC tech news

The deal expands CoreWeave’s access to power and real estate, giving it ownership of 1.3 gigawatts of gross capacity across Core Scientific’s U.S. data center footprint, with another gigawatt available for future growth.

Core Scientific has increasingly focused on high-performance compute workloads since emerging from bankruptcy and relisting on the Nasdaq in 2024.

Core Scientific shareholders will receive 0.1235 CoreWeave shares for each share they hold — implying a $20.40 per-share valuation and a 66% premium to Core Scientific’s closing stock price before deal talks were reported.

After closing, Core Scientific shareholders will own less than 10% of the combined company.

Continue Reading

Technology

Apple appeals 500 million euro EU fine over App Store policies

Published

on

By

Apple appeals 500 million euro EU fine over App Store policies

Two young men stand inside a shopping mall in front of a large illuminated Apple logo seen through a window in Chongqing, China, on June 4, 2025.

Cheng Xin | Getty Images

Apple on Monday appealed what it called an “unprecedented” 500 million euro ($586 million) fine issued by the European Union for violating the bloc’s Digital Markets Act.

“As our appeal will show, the EC [European Commission] is mandating how we run our store and forcing business terms which are confusing for developers and bad for users,” the company said in a statement. “We implemented this to avoid punitive daily fines and will share the facts with the Court.”

Apple recently made changes to its App Store‘s European policies that the company said would be in compliance with the DMA and would avoid the fines.

The Commission, which is the executive body of the EU, announced its fine in April, saying that Apple “breached its anti-steering obligation” under the DMA with restrictions on the App Store.

Read more CNBC tech news

“Due to a number of restrictions imposed by Apple, app developers cannot fully benefit from the advantages of alternative distribution channels outside the App Store,” the commission wrote. “Similarly, consumers cannot fully benefit from alternative and cheaper offers as Apple prevents app developers from directly informing consumers of such offers.”

Under the DMA, tech giants like Apple and Google are required to allow businesses to inform end-users of offers outside their platform — including those at different prices or with different conditions.

Companies like Epic Games and Spotify have complained about restrictions within the App Store that make it harder for them to communicate alternative payment methods to iOS users.

Apple typically takes a 15%-30% cut on in-app purchases.

Continue Reading

Trending