Connect with us

Published

on

Patronus AI cofounders Anand Kannappan and Rebecca Qian

Patronus AI

Large language models, similar to the one at the heart of ChatGPT, frequently fail to answer questions derived from Securities and Exchange Commission filings, researchers from a startup called Patronus AI found.

Even the best-performing AI model configuration they tested, OpenAI’s GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI’s new test, the company’s founders told CNBC.

Oftentimes, the so-called large language models would refuse to answer, or would “hallucinate” figures and facts that weren’t in the SEC filings.

“That type of performance rate is just absolutely unacceptable,” Patronus AI cofounder Anand Kannappan said. “It has to be much much higher for it to really work in an automated and production-ready way.”

The findings highlight some of the challenges facing AI models as big companies, especially in regulated industries like finance, seek to incorporate cutting-edge technology into their operations, whether for customer service or research.

The ability to extract important numbers quickly and perform analysis on financial narratives has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year. SEC filings are filled with important data, and if a bot could accurately summarize them or quickly answer questions about what’s in them, it could give the user a leg up in the competitive financial industry.

In the past year, Bloomberg LP developed its own AI model for financial data, business school professors researched whether ChatGPT can parse financial headlines, and JPMorgan is working on an AI-powered automated investing tool, CNBC previously reported. Generative AI could boost the banking industry by trillions of dollars per year, a recent McKinsey forecast said.

But GPT’s entry into the industry hasn’t been smooth. When Microsoft first launched its Bing Chat using OpenAI’s GPT, one of its primary examples was using the chatbot quickly summarize an earnings press release. Observers quickly realized that the numbers in Microsoft’s example were off, and some numbers were entirely made up.

‘Vibe checks’

Part of the challenge when incorporating LLMs into actual products, say the Patronus AI cofounders, is that LLMs are non-deterministic — they’re not guaranteed to produce the same output every time for the same input. That means that companies will need to do more rigorous testing to make sure they’re operating correctly, not going off-topic, and providing reliable results.

The founders met at Facebook parent-company Meta, where they worked on AI problems related to understanding how models come up with their answers and making them more “responsible.” They founded Patronus AI, which has received seed funding from Lightspeed Venture Partners, to automate LLM testing with software, so companies can feel comfortable that their AI bots won’t surprise customers or workers with off-topic or wrong answers.

“Right now evaluation is largely manual. It feels like just testing by inspection,” Patronus AI cofounder Rebecca Qian said. “One company told us it was ‘vibe checks.'”

Patronus AI worked to write a set of over 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning.

Qian and Kannappan say it’s a test that gives a “minimum performance standard” for language AI in the financial sector.

Here’s some examples of questions in the dataset, provided by Patronus AI:

  • Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
  • Did AMD report customer concentration in FY22?
  • What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.

How the AI models did on the test

Patronus AI tested four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2, using a subset of 150 of the questions it had produced.

It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called “Oracle” mode. In other tests, the models were told where the underlying SEC documents would be stored, or given “long context,” which meant including nearly an entire SEC filing alongside the question in the prompt.

GPT-4-Turbo failed at the startup’s “closed book” test, where it wasn’t given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times.

It was able to improve significantly when given access to the underlying filings. In “Oracle” mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time.

But that’s an unrealistic test because it requires human input to find the exact pertinent place in the filing — the exact task that many hope that language models can address.

Llama2, an open-source AI model developed by Meta, had some of the worst “hallucinations,” producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents.

Anthropic’s Claude2 performed well when given “long context,” where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.

After running the tests, the cofounders were surprised about how poorly the models did — even when they were pointed to where the answers were.

“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”

Even when the models performed well, though, they just weren’t good enough, Patronus AI found.

“There just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong one out of 20 times, that’s still not high enough accuracy,” Qian said.

But the Patronus AI cofounders believe there’s huge potential for language models like GPT to help people in the finance industry — whether that’s analysts, or investors — if AI continues to improve.

“We definitely think that the results can be pretty promising,” said Kannappan. “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”

An OpenAI representative pointed to the company’s usage guidelines, which prohibit offering tailored financial advice using an OpenAI model without a qualified person reviewing the information, and require anyone using an OpenAI model in the financial industry to provide a disclaimer informing them that AI is being used and its limitations. OpenAI’s usage policies also say that OpenAI’s models are not fine-tuned to provide financial advice.

Meta did not immediately return a request for comment, and Anthropic didn’t immediately have a comment.

Continue Reading

Technology

Palantir stock slumps 9%, falling for a fifth straight day from record

Published

on

By

Palantir stock slumps 9%, falling for a fifth straight day from record

CEO of Palantir Technologies Alex Karp attends the Pennsylvania Energy and Innovation Summit on the campus of Carnegie Mellon University in Pittsburgh, Pennsylvania on July 15, 2025.

Andrew Caballero-reynolds | Afp | Getty Images

Palantir‘s stock slumped more than 9% on Tuesday, falling for a fifth straight day to continue its pullback from all-time highs.

The artificial intelligence software provider’s stock has slid more than 15% over the last five trading sessions, after a stellar earnings report earlier this month propelled shares to all-time highs. The report was Palantir’s first-ever $1 billion revenue quarter.

Tuesday’s dip coincided with a broader market pullback.

Palantir is the most significant gainer to date in the S&P 500 in 2025, up more than 100%.

Read more CNBC tech news

Shares have more than doubled as the company benefits from ongoing AI enthusiasm, scooping up government contracts with President Donald Trump pushing to overhaul agencies.

Palantir’s ascent has pushed the company into a list of top 10 U.S. tech firms and 20 most valuable U.S. companies, while also making shares incredibly expensive to own. Its forward price-to-earnings ratio, which tracks future earnings relative to share price, has soared past 245 times.

By comparison, technology giants such as Microsoft and Apple carry a P/E of nearly 30 times and rake in significantly greater quarterly revenues. Meta‘s and Alphabet‘s P/E ratios hover in the 20s.

What to know about Palantir's engineer-led sales strategy

Continue Reading

Technology

Databricks says it’s valued at over $100 billion in latest funding round

Published

on

By

Databricks says it's valued at over 0 billion in latest funding round

Ali Ghodsi, CEO of Databricks speaks on CNBC.

CNBC

Databricks has just entered an exclusive club.

The data analytics software vendor said Tuesday that it’s raising a funding round that values the company at over $100 billion. That would make Databricks just the fourth private company to eclipse the $100 billion mark, following SpaceX, ByteDance and OpenAI, according to data from CB Insights.

Databricks CEO Ali Ghodsi told CNBC’s Brian Sullivan that the total round will exceed $1 billion. The company was last valued by private investors at $62 billion in a $10 billion financing round late last year.

In June, Databricks executives told investors the company was forecasting $3.7 billion in annualized revenue by July, with 50% year-over-year growth.

Snowflake, one of Databricks’ top rivals, is expected to generate $4.5 billion in revenue for the fiscal year that ends in January, representing annual growth of 25%, according to LSEG. Snowflake currently has a market cap of about $65 billion. Other competitors include cloud providers such as Amazon and Microsoft, which are also Databricks partners.

Ghodsi said he heard from a lot of interested investors following Figma’s IPO late last month. Shares of the design software company more than tripled in their New York Stock Exchange debut, a sign that public investors are seeking out tech offerings after in extended lull in the IPO market.

“My phone was blowing up,” Ghodsi said on Tuesday. “So yes, there’s definitely been a big push from outside.”

Figma shares have since retreated from their initial $115.50 closing price. The stock is trading at about $70, still more than double the $33 IPO price.

Ghodsi said the round will help Databricks invest in products that clients can tap when using artificial intelligence models.

Founded in 2013 and based in San Francisco, Databricks ranked third on CNBC’s 2025 Disruptor 50 list. As of June, the company employed 8,000 people. Existing investors Andreessen Horowitz, Insight Partners Thrive Capital and WCM Investment Management are buying shares, a spokesperson said.

WATCH: Databricks CEO on AI: VCs are wondering if agentic AI will actually automate work

Databricks CEO on AI: VCs are wondering if agentic AI will actually automate work

Continue Reading

Technology

Crypto stocks tumble on Tuesday as investors go into risk-off mode

Published

on

By

Crypto stocks tumble on Tuesday as investors go into risk-off mode

The Coinbase logo is displayed on a mobile phone screen with stock market percentages in the background.

Idrees Abbas | Sopa Images | Lightrocket | Getty Images

Crypto stocks suffered on Tuesday as investors fled tech stocks and riskier corners of the market.

Among crypto exchanges, Coinbase and eToro fell more than 5% each, while Robinhood and Bullish both dropped more than 6%. Crypto financial services firm Galaxy Digital dropped 11%. In the burgeoning sector of crypto treasury firms, Strategy lost 7%, SharpLink Gaming slid 8%, Bitmine Immersion slumped 12% and DeFi Development tumbled 15%. Stablecoin issuer Circle lost 5%.

Meanwhile, the price of bitcoin pulled back nearly 3% to just over $113,000. Ether was down more than 4% to the $4,100 level, according to Coin Metrics.

Stock Chart IconStock chart icon

hide content

Bitcoin over the past day

Investors appeared to rotate out of tech names on Tuesday. The sector had seen a boost last week as traders weighed the prospect of more interest rate cuts. Also, bitcoin touched an intraday all-time high near $125,000 last week.

On Tuesday, the Nasdaq Composite was down more than 1%, weighed down by declines in Nvidia and other tech heavyweights.

The crypto market tends to be vulnerable to moves in tech stocks due to their growth-oriented investor base, narrative-driven price action, speculative nature and tendency to thrive in low-interest rate environments.

This week, investors are watching the Federal Reserve’s annual economic symposium in Jackson Hole, Wyo. for clues around what could happen at the central bank’s remaining policy meetings this year. If Fed Chair Jerome Powell signals more dovish policy could be ahead, crypto may bounce.

“With Powell speaking at Jackson Hole, we typically see profit-taking ahead of his remarks,” said Satraj Bambra, CEO of hybrid exchange Rails. “Any time there’s communication uncertainty from the Fed, you can generally expect some profit-taking as traders de-risk their positions.”

Crypto stocks have had a solid run in recent months — thanks to the addition of Coinbase in the benchmark S&P 500 index, the successful IPO of Circle and the GENIUS Act stablecoin framework becoming law. However, investors expect a pullback in August and through the September Fed meeting, where they hope to see central bank policymakers implement rate cuts.

Don’t miss these cryptocurrency insights from CNBC Pro:

Continue Reading

Trending