Patronus AI cofounders Anand Kannappan and Rebecca Qian
Patronus AI
Large language models, similar to the one at the heart of ChatGPT, frequently fail to answer questions derived from Securities and Exchange Commission filings, researchers from a startup called Patronus AI found.
Even the best-performing AI model configuration they tested, OpenAI’s GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI’s new test, the company’s founders told CNBC.
Oftentimes, the so-called large language models would refuse to answer, or would “hallucinate” figures and facts that weren’t in the SEC filings.
“That type of performance rate is just absolutely unacceptable,” Patronus AI cofounder Anand Kannappan said. “It has to be much much higher for it to really work in an automated and production-ready way.”
The findings highlight some of the challenges facing AI models as big companies, especially in regulated industries like finance, seek to incorporate cutting-edge technology into their operations, whether for customer service or research.
The ability to extract important numbers quickly and perform analysis on financial narratives has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year. SEC filings are filled with important data, and if a bot could accurately summarize them or quickly answer questions about what’s in them, it could give the user a leg up in the competitive financial industry.
In the past year, Bloomberg LP developed its own AI model for financial data, business school professors researched whether ChatGPT can parse financial headlines, and JPMorgan is working on an AI-powered automated investing tool, CNBC previously reported. Generative AI could boost the banking industry by trillions of dollars per year, a recent McKinsey forecast said.
But GPT’s entry into the industry hasn’t been smooth. When Microsoft first launched its Bing Chat using OpenAI’s GPT, one of its primary examples was using the chatbot quickly summarize an earnings press release. Observers quickly realized that the numbers in Microsoft’s example were off, and some numbers were entirely made up.
‘Vibe checks’
Part of the challenge when incorporating LLMs into actual products, say the Patronus AI cofounders, is that LLMs are non-deterministic — they’re not guaranteed to produce the same output every time for the same input. That means that companies will need to do more rigorous testing to make sure they’re operating correctly, not going off-topic, and providing reliable results.
The founders met at Facebook parent-company Meta, where they worked on AI problems related to understanding how models come up with their answers and making them more “responsible.” They founded Patronus AI, which has received seed funding from Lightspeed Venture Partners, to automate LLM testing with software, so companies can feel comfortable that their AI bots won’t surprise customers or workers with off-topic or wrong answers.
“Right now evaluation is largely manual. It feels like just testing by inspection,” Patronus AI cofounder Rebecca Qian said. “One company told us it was ‘vibe checks.'”
Patronus AI worked to write a set of over 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning.
Qian and Kannappan say it’s a test that gives a “minimum performance standard” for language AI in the financial sector.
Here’s some examples of questions in the dataset, provided by Patronus AI:
Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
Did AMD report customer concentration in FY22?
What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.
How the AI models did on the test
Patronus AI tested four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2, using a subset of 150 of the questions it had produced.
It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called “Oracle” mode. In other tests, the models were told where the underlying SEC documents would be stored, or given “long context,” which meant including nearly an entire SEC filing alongside the question in the prompt.
GPT-4-Turbo failed at the startup’s “closed book” test, where it wasn’t given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times.
It was able to improve significantly when given access to the underlying filings. In “Oracle” mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time.
But that’s an unrealistic test because it requires human input to find the exact pertinent place in the filing — the exact task that many hope that language models can address.
Llama2, an open-source AI model developed by Meta, had some of the worst “hallucinations,” producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents.
Anthropic’s Claude2 performed well when given “long context,” where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.
After running the tests, the cofounders were surprised about how poorly the models did — even when they were pointed to where the answers were.
“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”
Even when the models performed well, though, they just weren’t good enough, Patronus AI found.
“There just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong one out of 20 times, that’s still not high enough accuracy,” Qian said.
But the Patronus AI cofounders believe there’s huge potential for language models like GPT to help people in the finance industry — whether that’s analysts, or investors — if AI continues to improve.
“We definitely think that the results can be pretty promising,” said Kannappan. “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”
An OpenAI representative pointed to the company’s usage guidelines, which prohibit offering tailored financial advice using an OpenAI model without a qualified person reviewing the information, and require anyone using an OpenAI model in the financial industry to provide a disclaimer informing them that AI is being used and its limitations. OpenAI’s usage policies also say that OpenAI’s models are not fine-tuned to provide financial advice.
Meta did not immediately return a request for comment, and Anthropic didn’t immediately have a comment.
This photo illustration created Jan. 7, 2025, shows an image of Mark Zuckerberg, CEO of Meta, and an image of the Meta logo.
Drew Angerer | Afp | Getty Images
Meta is seeking to stop the promotion of a new memoir by a former staffer that paints the social media company in an unflattering light, including allegations of sexual harassment by the company’s policy chief.
An emergency arbitrator ruled Thursday that Sarah Wynn-Williams is prohibited from promoting “Careless People,” her book that was released Tuesday by Flatiron Books, an imprint of publisher Macmillan Books.
The memoir chronicles Wynn-Williams’ tenure at Facebook from 2011 through 2017. During that time, she became a high-level employee who interacted with CEO Mark Zuckerberg, then-COO Sheryl Sandberg and Joel Kaplan, the company’s current policy chief. In the book, Wynn-Williams alleges that Kaplan made a number of inappropriate comments to her, which she then reported to the company as sexual harassment.
“This is a mix of out-of-date and previously reported claims about the company and false accusations about our executives,” a Meta spokesperson previously said about both her book and complaint.
Wynn-Williams also details in her book the company’s various attempts to enter the Chinese market, including building tools that would censor content to appease the Chinese Communist Party. Wynn-Williams addressed some of these China-specific claims in a whistleblower complaint that she filed in April with the Securities and Exchange Commission, NBC News reported.
The emergency arbitrator ruled in favor of Meta after watching a podcast appearance of Wynn-Williams in which she discussed her memoir and her allegations that Meta was attempting to “shut this book down.”
“The Emergency Arbitrator finds that, after reviewing the briefs and hearing oral argument, (Meta) has established a likelihood of success on the merits of its contractual non-disparagement claim against Respondent Wynn-Williams, and that immediate and irreparable loss will result in the absence of emergency relief,” the filing said.
Additionally, the arbitrator ruled that so much as Wynn-Williams can control, she is prohibited from further publishing or distributing the book and from further disparaging Meta and its officers or repeating previous disparaging remarks. The arbitrator also ruled that Wynn-Williams is to retract her previous disparaging remarks.
The company has previously dismissed Wynn-Williams’ claims as “out-of-date” and said that she was fired for “poor performance and toxic behavior.”
Meta spokesperson Andy Stone shared the emergency arbitrator’s ruling in a post on Threads, saying that it “affirms that Sarah Wynn Williams’ false and defamatory book should never have been published.”
“This urgent legal action was made necessary by Williams, who more than eight years after being terminated by the company, deliberately concealed the existence of her book project and avoided the industry’s standard fact-checking process in order to rush it to shelves after waiting for eight years,” Stone said.
Meta alleged that Wynn-Williams violated the non-disparagement terms of her September 2017 severance agreement, resulting in the company filing an emergency motion on Friday. The emergency arbitrator then conducted a telephone hearing involving legal representatives of Meta and Macmillan Books, but not Wynn-Williams who did not appear though she was given notice, the filing said.
Wynn-Williams, Flatiron Books and Macmillan Books did not respond to requests for comment.
Lip-Bu Tan appointed chief executive officer of Intel Corporation
Courtesy: Intel
Intel said on Wednesday that it had appointed Lip-Bu Tan as its new CEO, as the chipmaker attempts to recover from a tumultuous four-year run under Pat Gelsinger.
Tan was previously CEO of Cadence Design Systems, which makes software used by all the major chip designers, including Intel. He was an Intel board member but departed last year, citing other commitments.
Tan replaces interim co-CEOs David Zinsner and MJ Holthaus, who took over in December when former Intel CEO Patrick Gelsinger was ousted. Tan is also rejoining Intel’s board.
The appointment closes a chaotic chapter in Intel’s history, as investors pressured the semiconductor company to cut costs and spin off businesses due to declining sales and an inability to crack the booming artificial intelligence market.
Intel shares rose over 12% in extended trading on Wednesday.
Tan becomes the fourth permanent CEO at Intel in seven years. Following Brian Krzanich’s resignation in 2018, after the revelations of an inappropriate relationship with an employee, Bob Swan took the helm in Jan. 2019. He departed two years later after Intel suffered numerous blows from competitors and chip delays. Swan was succeeded by Gelsinger in 2021.
Gelsinger took over with a bold plan to transform Intel’s business to manufacture chips for other companies in addition to its own, becoming a foundry. But Intel’s overall products revenue continued to decline, and investors fretted over the significant capital expenditures needed for such massive chip production, including constructing a $20 billion dollar factory complex in Ohio.
Last fall, after a disappointing earnings report, Intel appeared to be for sale, and reportedly drew interest from rival companies including Qualcomm. Analysts assessed the possibility of Intel spinning off its foundry division or selling its products division — including server and PC chips — to a rival.
In AI, Intel has gotten trounced by Nvidia, whose graphics processing units (GPUs) have become the chip of choice for developers over the past few years.
In January, Intel issued a weak forecast even as it beat on earnings and revenue. The company pointed to seasonality, economic conditions and competition, and said clients are digesting inventory. The prospect of tariffs was adding to the uncertainty, Zinsner said.
Intel said that Zinsner will return to his previous role of CFO. Holthaus will remain in charge of Intel Products.
Intel was removed from the Dow Jones Industrial Average in November and was replaced by Nvidia, reflecting the dramatic change of fortune in the semiconductor industry. Intel shares lost 60% of their value last year, while Nvidia’s stock price soared 171%. At Wednesday’s close, Intel’s market cap was $89.5 billion, less than one-thirtieth of Nvidia’s valuation.
Roomba vacuums by iRobot are displayed at Best Buy store on January 19, 2024 in San Rafael, California.
Justin Sullivan | Getty Images
Shares of iRobot plunged more than 30% on Wednesday after it said there is “substantial doubt” about its ability to stay in business.
The Roomba maker’s financial outlook has darkened since Amazon abandoned its planned $1.7 billion acquisition of the company in January 2024, citing regulatory scrutiny. Since then, iRobot has struggled to generate cash and pay off debts.
Massachusetts-based iRobot has been restructuring since the Amazon deal plunged into uncertainty. The company has laid off 51% of its workforce since the end of 2023, and iRobot has looked to reignite revenue growth by overhauling its product lineup. The company on Tuesday launched eight new Roombas in the hopes of “better positioning iRobot as the leader in the category that we created,” CEO Gary Cohen said in a statement.
“There can be no assurance that the new product launches will be successful,” iRobot said in its Wednesday earnings statement, citing limited consumer demand, tariff uncertainty and heightened competition.
“Given these uncertainties and the implication they may have on the company’s financials, there is substantial doubt about the company’s ability to continue as a going concern for a period of at least 12 months,” iRobot said in its earnings report.
The company’s fourth-quarter revenue sagged 44% year over year to $172 million, missing estimates of $180.8 million, according to FactSet. The Roomba maker posted a net loss of $77.1 million, or $2.52 per share. Excluding a one-time “manufacturing transition charge,” iRobot had a loss of $2.06 a share, exceeding the $1.73 per share projected by analysts surveyed by FactSet.
In July 2023, iRobot took a $200 million loan from the Carlyle Group to fund the company’s operations as a stopgap until the Amazon deal closed. The company amended the loan for a temporary waiver on certain financial obligations, which requires iRobot to pay a fee of $3.6 million.
As part of Wednesday’s report, iRobot said its board has initiated a strategic review of the business and is considering alternatives that could include refinancing its debt and exploring a potential sale. The board hasn’t set a deadline for when its review will conclude, the company said.
The proposed merger, which was announced in late 2022, would have allowed iRobot to scale and better compete with its rivals, Jassy said. Several of the fastest-growing robotic vacuum businesses are based in China, such as Anker, Ecovacs and Roborock, all of which have eaten into iRobot’s share of the market.
“We abdicate the acquisition, iRobot lays off a third of its staff, the stock price completely tanks, and now, there’s a real question of whether they’re going to be a going concern,” Jassy told CNBC’s Andrew Ross Sorkin in an interview last April.