Connect with us

Published

on

Patronus AI cofounders Anand Kannappan and Rebecca Qian

Patronus AI

Large language models, similar to the one at the heart of ChatGPT, frequently fail to answer questions derived from Securities and Exchange Commission filings, researchers from a startup called Patronus AI found.

Even the best-performing AI model configuration they tested, OpenAI’s GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI’s new test, the company’s founders told CNBC.

Oftentimes, the so-called large language models would refuse to answer, or would “hallucinate” figures and facts that weren’t in the SEC filings.

“That type of performance rate is just absolutely unacceptable,” Patronus AI cofounder Anand Kannappan said. “It has to be much much higher for it to really work in an automated and production-ready way.”

The findings highlight some of the challenges facing AI models as big companies, especially in regulated industries like finance, seek to incorporate cutting-edge technology into their operations, whether for customer service or research.

The ability to extract important numbers quickly and perform analysis on financial narratives has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year. SEC filings are filled with important data, and if a bot could accurately summarize them or quickly answer questions about what’s in them, it could give the user a leg up in the competitive financial industry.

In the past year, Bloomberg LP developed its own AI model for financial data, business school professors researched whether ChatGPT can parse financial headlines, and JPMorgan is working on an AI-powered automated investing tool, CNBC previously reported. Generative AI could boost the banking industry by trillions of dollars per year, a recent McKinsey forecast said.

But GPT’s entry into the industry hasn’t been smooth. When Microsoft first launched its Bing Chat using OpenAI’s GPT, one of its primary examples was using the chatbot quickly summarize an earnings press release. Observers quickly realized that the numbers in Microsoft’s example were off, and some numbers were entirely made up.

‘Vibe checks’

Part of the challenge when incorporating LLMs into actual products, say the Patronus AI cofounders, is that LLMs are non-deterministic — they’re not guaranteed to produce the same output every time for the same input. That means that companies will need to do more rigorous testing to make sure they’re operating correctly, not going off-topic, and providing reliable results.

The founders met at Facebook parent-company Meta, where they worked on AI problems related to understanding how models come up with their answers and making them more “responsible.” They founded Patronus AI, which has received seed funding from Lightspeed Venture Partners, to automate LLM testing with software, so companies can feel comfortable that their AI bots won’t surprise customers or workers with off-topic or wrong answers.

“Right now evaluation is largely manual. It feels like just testing by inspection,” Patronus AI cofounder Rebecca Qian said. “One company told us it was ‘vibe checks.'”

Patronus AI worked to write a set of over 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning.

Qian and Kannappan say it’s a test that gives a “minimum performance standard” for language AI in the financial sector.

Here’s some examples of questions in the dataset, provided by Patronus AI:

  • Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
  • Did AMD report customer concentration in FY22?
  • What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.

How the AI models did on the test

Patronus AI tested four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2, using a subset of 150 of the questions it had produced.

It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called “Oracle” mode. In other tests, the models were told where the underlying SEC documents would be stored, or given “long context,” which meant including nearly an entire SEC filing alongside the question in the prompt.

GPT-4-Turbo failed at the startup’s “closed book” test, where it wasn’t given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times.

It was able to improve significantly when given access to the underlying filings. In “Oracle” mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time.

But that’s an unrealistic test because it requires human input to find the exact pertinent place in the filing — the exact task that many hope that language models can address.

Llama2, an open-source AI model developed by Meta, had some of the worst “hallucinations,” producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents.

Anthropic’s Claude2 performed well when given “long context,” where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.

After running the tests, the cofounders were surprised about how poorly the models did — even when they were pointed to where the answers were.

“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”

Even when the models performed well, though, they just weren’t good enough, Patronus AI found.

“There just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong one out of 20 times, that’s still not high enough accuracy,” Qian said.

But the Patronus AI cofounders believe there’s huge potential for language models like GPT to help people in the finance industry — whether that’s analysts, or investors — if AI continues to improve.

“We definitely think that the results can be pretty promising,” said Kannappan. “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”

An OpenAI representative pointed to the company’s usage guidelines, which prohibit offering tailored financial advice using an OpenAI model without a qualified person reviewing the information, and require anyone using an OpenAI model in the financial industry to provide a disclaimer informing them that AI is being used and its limitations. OpenAI’s usage policies also say that OpenAI’s models are not fine-tuned to provide financial advice.

Meta did not immediately return a request for comment, and Anthropic didn’t immediately have a comment.

Continue Reading

Technology

How quantum could supercharge Google’s AI ambitions

Published

on

By

How quantum could supercharge Google’s AI ambitions

Inside a secretive set of buildings in Santa Barbara, California, scientists at Alphabet are working on one of the company’s most ambitious bets yet. They’re attempting to develop the world’s most advanced quantum computers.

“In the future, quantum and AI, they could really complement each other back and forth,” said Julian Kelly, director of hardware at Google Quantum AI.

Google has been viewed by many as late to the generative AI boom, because OpenAI broke into the mainstream first with ChatGPT in late 2022.

Late last year, Google made clear that it wouldn’t be caught on the backfoot again. The company unveiled a breakthrough quantum computing chip called Willow, which it says can solve a benchmark problem unimaginably faster than what’s possible with a classical computer, and demonstrated that adding more quantum bits to the chip reduced errors exponentially. 

“That’s a milestone for the field,” said John Preskill, director of the Caltech Institute for Quantum Information and Matter. “We’ve been wanting to see that for quite a while.”

Willow may now give Google a chance to take the lead in the next technological era. It also could be a way to turn research into a commercial opportunity, especially as AI hits a data wall. Leading AI models are running out of high-quality data to train on after already scraping much of the data on the internet.

“One of the potential applications that you can think of for a quantum computer is generating new and novel data,” said Kelly. 

He uses the example of AlphaFold, an AI model developed by Google DeepMind that helps scientists study protein structures. Its creators won the 2024 Nobel Prize in Chemistry. 

“[AlphaFold] trains on data that’s informed by quantum mechanics, but that’s actually not that common,” said Kelly. “So a thing that a quantum computer could do is generate data that AI could then be trained on in order to give it a little more information about how quantum mechanics works.” 

Kelly has said that he believes Google is only about five years away from a breakout, practical application that can only be solved on a quantum computer. But for Google to win the next big platform shift, it would have to turn a breakthrough into a business. 

Watch the video to learn more.

Continue Reading

Technology

Nintendo Switch 2 retail preorder to begin April 24 following tariff delays

Published

on

By

Nintendo Switch 2 retail preorder to begin April 24 following tariff delays

An attendee wearing a Super Mario costume uses a Nintendo Switch 2 game console while playing a video game during the Nintendo Switch 2 Experience at the ExCeL London international exhibition and convention centre in London, Britain, April 11, 2025. 

Isabel Infantes | Reuters

Nintendo on Friday announced that retail preorder for its Nintendo Switch 2 gaming system will begin on April 24 starting at $449.99.

Preorders for the hotly anticipated console were initially slated for April 9, but Nintendo delayed the date to assess the impact of the far-reaching, aggressive “reciprocal” tariffs that President Donald Trump announced earlier this month.

Most electronics companies, including Nintendo, manufacture their products in Asia. Nintendo’s Switch 1 consoles were made in China and Vietnam, Reuters reported in 2019. Trump has imposed a 145% tariff rate on China and a 10% rate on Vietnam. The latter is down from 46%, after he instituted a 90-day pause to allow for negotiations.

Nintendo said Friday that the Switch 2 will cost $449.99 in the U.S., which is the same price the company first announced on April 2.

“We apologize for the retail pre-order delay, and hope this reduces some of the uncertainty our consumers may be experiencing,” Nintendo said in a statement. “We thank our customers for their patience, and we share their excitement to experience Nintendo Switch 2 starting June 5, 2025.”

The Nintendo Switch 2 and “Mario Kart World bundle will cost $499.99, the digital version “Mario Kart World” will cost $79.99 and the digital version of “Donkey Kong Bananza” will cost $69.99, Nintendo said. All of those prices remain unchanged from the company’s initial announcement.

However, accessories for the Nintendo Switch 2 will “experience price adjustments,” the company said, and other future changes in costs are possible for “any Nintendo product.”

It will cost gamers $10 more to by the dock set, $1 more to buy the controller strap and $5 more to buy most other accessories, for instance.

WATCH: Nintendo has ‘a lot of work to do’ to convince casual users to upgrade to Switch 2: Kantan Games

Nintendo has 'a lot of work to do' to convince casual users to upgrade to Switch 2: Kantan Games

Continue Reading

Technology

Etsy touts ‘shopping domestically’ as Trump tariffs threaten price increases for imports

Published

on

By

Etsy touts 'shopping domestically' as Trump tariffs threaten price increases for imports

An employee walks past a quilt displaying Etsy Inc. signage at the company’s headquarters in the Brooklyn.

Victor J. Blue/Bloomberg via Getty Images

Etsy is trying to make it easier for shoppers to purchase products from local merchants and avoid the extra cost of imports as President Donald Trump’s sweeping tariffs raise concerns about soaring prices.

In a post to Etsy’s website on Thursday, CEO Josh Silverman said the company is “surfacing new ways for buyers to discover businesses in their countries” via shopping pages and by featuring local sellers on its website and app.

“While we continue to nurture and enable cross-border trade on Etsy, we understand that people are increasingly interested in shopping domestically,” Silverman said.

Etsy operates an online marketplace that connects buyers and sellers with mostly artisanal and handcrafted goods. The site, which had 5.6 million active sellers as of the end of December, competes with e-commerce juggernaut Amazon, as well as newer entrants that have ties to China like Temu, Shein and TikTok Shop.

By highlighting local sellers, Etsy could relieve some shoppers from having to pay higher prices induced by President Trump’s widespread tariffs on trade partners. Trump has imposed tariffs on most foreign countries, with China facing a rate of 145%, and other nations facing 10% rates after he instituted a 90-day pause to allow for negotiations. Trump also signed an executive order that will end the de minimis provision, a loophole for low-value shipments often used by online businesses, on May 2.

Temu and Shein have already announced they plan to raise prices late next week in response to the tariffs. Sellers on Amazon’s third-party marketplace, many of whom source their products from China, have said they’re considering raising prices.

Silverman said Etsy has provided guidance for its sellers to help them “run their businesses with as little disruption as possible” in the wake of tariffs and changes to the de minimis exemption.

Before Trump’s “Liberation Day” tariffs took effect, Silverman said on the company’s fourth-quarter earnings call in late February that he expects Etsy to benefit from the tariffs and de minimis restrictions because it “has much less dependence on products coming in from China.”

“We’re doing whatever work we can do to anticipate and prepare for come what may,” Silverman said at the time. “In general, though, I think Etsy will be more resilient than many of our competitors in these situations.”

Still, American shoppers may face higher prices on Etsy as U.S. businesses that source their products or components from China pass some of those costs on to consumers.

Etsy shares are down 17% this year, slightly more than the Nasdaq.

WATCH: Amazon CEO Andy Jassy says sellers will pass cost of tariffs on to consumers

Amazon CEO Andy Jassy: Sellers will pass increased tariff costs on to consumers

Continue Reading

Trending