Patronus AI cofounders Anand Kannappan and Rebecca Qian
Patronus AI
Large language models, similar to the one at the heart of ChatGPT, frequently fail to answer questions derived from Securities and Exchange Commission filings, researchers from a startup called Patronus AI found.
Even the best-performing AI model configuration they tested, OpenAI’s GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI’s new test, the company’s founders told CNBC.
Oftentimes, the so-called large language models would refuse to answer, or would “hallucinate” figures and facts that weren’t in the SEC filings.
“That type of performance rate is just absolutely unacceptable,” Patronus AI cofounder Anand Kannappan said. “It has to be much much higher for it to really work in an automated and production-ready way.”
The findings highlight some of the challenges facing AI models as big companies, especially in regulated industries like finance, seek to incorporate cutting-edge technology into their operations, whether for customer service or research.
The ability to extract important numbers quickly and perform analysis on financial narratives has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year. SEC filings are filled with important data, and if a bot could accurately summarize them or quickly answer questions about what’s in them, it could give the user a leg up in the competitive financial industry.
In the past year, Bloomberg LP developed its own AI model for financial data, business school professors researched whether ChatGPT can parse financial headlines, and JPMorgan is working on an AI-powered automated investing tool, CNBC previously reported. Generative AI could boost the banking industry by trillions of dollars per year, a recent McKinsey forecast said.
But GPT’s entry into the industry hasn’t been smooth. When Microsoft first launched its Bing Chat using OpenAI’s GPT, one of its primary examples was using the chatbot quickly summarize an earnings press release. Observers quickly realized that the numbers in Microsoft’s example were off, and some numbers were entirely made up.
‘Vibe checks’
Part of the challenge when incorporating LLMs into actual products, say the Patronus AI cofounders, is that LLMs are non-deterministic — they’re not guaranteed to produce the same output every time for the same input. That means that companies will need to do more rigorous testing to make sure they’re operating correctly, not going off-topic, and providing reliable results.
The founders met at Facebook parent-company Meta, where they worked on AI problems related to understanding how models come up with their answers and making them more “responsible.” They founded Patronus AI, which has received seed funding from Lightspeed Venture Partners, to automate LLM testing with software, so companies can feel comfortable that their AI bots won’t surprise customers or workers with off-topic or wrong answers.
“Right now evaluation is largely manual. It feels like just testing by inspection,” Patronus AI cofounder Rebecca Qian said. “One company told us it was ‘vibe checks.'”
Patronus AI worked to write a set of over 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning.
Qian and Kannappan say it’s a test that gives a “minimum performance standard” for language AI in the financial sector.
Here’s some examples of questions in the dataset, provided by Patronus AI:
Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
Did AMD report customer concentration in FY22?
What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.
How the AI models did on the test
Patronus AI tested four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2, using a subset of 150 of the questions it had produced.
It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called “Oracle” mode. In other tests, the models were told where the underlying SEC documents would be stored, or given “long context,” which meant including nearly an entire SEC filing alongside the question in the prompt.
GPT-4-Turbo failed at the startup’s “closed book” test, where it wasn’t given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times.
It was able to improve significantly when given access to the underlying filings. In “Oracle” mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time.
But that’s an unrealistic test because it requires human input to find the exact pertinent place in the filing — the exact task that many hope that language models can address.
Llama2, an open-source AI model developed by Meta, had some of the worst “hallucinations,” producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents.
Anthropic’s Claude2 performed well when given “long context,” where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.
After running the tests, the cofounders were surprised about how poorly the models did — even when they were pointed to where the answers were.
“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”
Even when the models performed well, though, they just weren’t good enough, Patronus AI found.
“There just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong one out of 20 times, that’s still not high enough accuracy,” Qian said.
But the Patronus AI cofounders believe there’s huge potential for language models like GPT to help people in the finance industry — whether that’s analysts, or investors — if AI continues to improve.
“We definitely think that the results can be pretty promising,” said Kannappan. “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”
An OpenAI representative pointed to the company’s usage guidelines, which prohibit offering tailored financial advice using an OpenAI model without a qualified person reviewing the information, and require anyone using an OpenAI model in the financial industry to provide a disclaimer informing them that AI is being used and its limitations. OpenAI’s usage policies also say that OpenAI’s models are not fine-tuned to provide financial advice.
Meta did not immediately return a request for comment, and Anthropic didn’t immediately have a comment.
Bitcoin briefly dropped below the $90,000 mark on Monday, extending its slide as investors continue to dump growth oriented assets like crypto and tech stocks.
The price of the flagship cryptocurrency was last lower by 3% at $91,358.66 to start the week, according to Coin Metrics. Earlier, it fell as low as $89,259.00. Bitcoin is down 10% in the past week.
Stock Chart IconStock chart icon
Bitcoin extends its slide as growth-oriented assets continue to get hit
“The need for liquidity is caused by FX spikes because of strong end-of-year U.S. economy number, the stock market rallying strong, and there are other places money is needed in the short-term,” said James Davies, co-founder and CEO at crypto trading platform Crypto Valley Exchange. “If we want bitcoin to act like a currency, we need to accept when it does, and this is one of those times. The U.S. Dollar has gotten stronger ad everything else including bitcoin is weaker when measured in dollars.”
Investor sentiment was optimistic coming into 2025, with markets looking forward to having a pro-crypto Congress and White House. That hope had outweighed any concern about macroeconomic-related speedbumps, until last week.
Investors are now warning that the first quarter of this year could be more turbulent for crypto than previously anticipated.
Bitcoin’s price grew 120% in 2024 but is down 3% so far in the new year.
Don’t miss these cryptocurrency insights from CNBC Pro:
Health-care payments company Waystar on Monday announced a new generative artificial intelligence tool that can help hospitals quickly tackle one of their most costly and tedious responsibilities: fighting insurance denials.
Hospitals and health systems spend nearly $20 billion a year trying to overturn denied claims, according to a March report from the group purchasing organization Premier.
“We think if we can develop software that makes people’s lives better in an otherwise stressful moment of time when they’re getting health-care, then we’re doing something good,” Waystar CEO Matt Hawkins told CNBC.
Waystar’s new solution, called AltitudeCreate, uses generative AI to automatically draft appeal letters. The company said the feature could help providers drive down costs and spare them the headache of digging through complex contracts and records to put the letters together manually.
Hawkins led Waystar through its initial public offering in June, where it raised around $1 billion. The company handled more than $1.2 trillion in gross claims volume in 2023, touching about 50% of patients in the U.S.
Claim denials have become a hot-button issue across the nation following the deadly shooting of UnitedHealthcare CEO Brian Thompson in December. Americans flooded social media with posts about their frustrations and resentment toward the insurance industry, often sharing stories about their own negative experiences.
Read more CNBC reporting on AI
When a patient receives medical care in the U.S., it kicks off a notoriously complex billing process. Providers like hospitals, health systems or ambulatory care facilities submit an invoice called a claim to an insurance company, and the insurer will approve or deny the claim based on whether or not it meets the company’s criteria for reimbursement.
If a claim is denied, patients are often responsible for covering the cost out-of-pocket. More than 450 million claims are denied each year, and denial rates are rising, Waystar said.
Providers can ask insurers to reevaluate claim denials by submitting an appeal letter, but drafting these letters is a time-consuming and expensive process that doesn’t guarantee a different outcome.
Hawkins said that while there’s been a lot of discussion around claims denials recently, AltitudeCreate has been in the works at Waystar for the last six to eight months. The company announced an AI-focused partnership with Google Cloud in May, and automating claims denials was one of the 12 use cases the companies planned to explore.
Waystar has also had a denial and appeal management software module available for several years, Hawkins added.
AltitudeCreate is one tool available within a broader suite of Waystar’s AI offerings called AltitudeAI, which the company also unveiled on Monday. AltitudeCreate rolled out to organizations that are already using Waystar’s denial and appeal management software modules earlier this month at no additional cost, the company said.
Waystar plans to make the feature more broadly available in the future.
“In the face of all of this administrative waste in health-care where provider organizations are understaffed and don’t have time to even follow up on a claim when it does get denied, we’re bringing software to bear that helps to automate that experience,” Hawkins said.
Through the collaboration, General Catalyst portfolio companies will use AWS’ services to build and roll out AI tools for health systems more quickly. Aidoc, which applies AI to medical imaging, and Commure, which automates provider workflows with AI, will be the first two companies to participate.
No financial terms were disclosed in the announcement.
“Without a strong partner like Amazon and AWS to stand alongside them, to co-develop and support these companies … it’s not going to move as fast as we hope,” Chris Bischoff, head of global health-care investing at General Catalyst, told CNBC in an interview.
Health systems are strained in the U.S., with staff burnout, growing labor shortages and razor-thin margins. These challenges often seem enticing for enterprising tech startups to tackle, especially as the multi-trillion dollar health-care industry dangles the prospect of large financial returns.
Hospitals operate in a complex, technology-weary and highly-regulated sector that can be difficult for startups to break into. General Catalyst is hoping to help its companies fast-track the development and go-to-market process by leveraging resources like computing power from AWS.
Read more CNBC reporting on AI
General Catalyst is no stranger to taking big swings in health-care.
The firm has closed more than 60 digital health deals since 2020, behind only Gaingels and Alumni Ventures, according to a December report from PitchBook. Last January, General Catalyst shocked the industry by announcing that its new business, the Health Assurance Transformation Company, planned to acquire an Ohio-based health system – an unprecedented move in venture capital.
General Catalyst’s “deep understanding” of health systems’ financial and operating realities made it an attractive partner for AWS, Dan Sheeran, AWS’ general manager of Healthcare & Life Science, told CNBC. Sheeran and Bischoff began outlining the collaboration between the two groups after meeting in London around nine months ago.
AWS also has an established presence in the health-care sector. The company offers more health- and life-sciences-specific services than any other cloud provider, according to a release, and it inked other high-profile AI partnerships with GE HealthCare, Philips and others last year.
The partnership between General Catalyst and AWS will stretch over several years, but new tools from Aidoc and Commure are coming in 2025. Aidoc is exploring how it can use the cloud to tap data modalities across pathology, cardiology, genomics and other molecular information, for instance.
Aidoc and Commure were selected to kick off the collaboration because they have both established a product-market fit, are operational and are focused on issues that are a high priority for AWS customers.
“GC has spent a lot of time thinking about how health systems can transform themselves, and we recognize that it’s not going to be through 1,000 companies, and we need solutions that are really enterprise grade,” Bischoff said. “Amazon shares the same vision, so we are starting with these two.”
Though the partnership between General Catalyst and AWS is still in its early days, the organizations said they believe it will help serve as a way to meet the market’s growing demand for new solutions.
“Health system leaders who want to realize the benefits of AI now have an easier way to accomplish that,” Sheeran said.