Patronus AI cofounders Anand Kannappan and Rebecca Qian
Patronus AI
Large language models, similar to the one at the heart of ChatGPT, frequently fail to answer questions derived from Securities and Exchange Commission filings, researchers from a startup called Patronus AI found.
Even the best-performing AI model configuration they tested, OpenAI’s GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI’s new test, the company’s founders told CNBC.
Oftentimes, the so-called large language models would refuse to answer, or would “hallucinate” figures and facts that weren’t in the SEC filings.
“That type of performance rate is just absolutely unacceptable,” Patronus AI cofounder Anand Kannappan said. “It has to be much much higher for it to really work in an automated and production-ready way.”
The findings highlight some of the challenges facing AI models as big companies, especially in regulated industries like finance, seek to incorporate cutting-edge technology into their operations, whether for customer service or research.
The ability to extract important numbers quickly and perform analysis on financial narratives has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year. SEC filings are filled with important data, and if a bot could accurately summarize them or quickly answer questions about what’s in them, it could give the user a leg up in the competitive financial industry.
In the past year, Bloomberg LP developed its own AI model for financial data, business school professors researched whether ChatGPT can parse financial headlines, and JPMorgan is working on an AI-powered automated investing tool, CNBC previously reported. Generative AI could boost the banking industry by trillions of dollars per year, a recent McKinsey forecast said.
But GPT’s entry into the industry hasn’t been smooth. When Microsoft first launched its Bing Chat using OpenAI’s GPT, one of its primary examples was using the chatbot quickly summarize an earnings press release. Observers quickly realized that the numbers in Microsoft’s example were off, and some numbers were entirely made up.
‘Vibe checks’
Part of the challenge when incorporating LLMs into actual products, say the Patronus AI cofounders, is that LLMs are non-deterministic — they’re not guaranteed to produce the same output every time for the same input. That means that companies will need to do more rigorous testing to make sure they’re operating correctly, not going off-topic, and providing reliable results.
The founders met at Facebook parent-company Meta, where they worked on AI problems related to understanding how models come up with their answers and making them more “responsible.” They founded Patronus AI, which has received seed funding from Lightspeed Venture Partners, to automate LLM testing with software, so companies can feel comfortable that their AI bots won’t surprise customers or workers with off-topic or wrong answers.
“Right now evaluation is largely manual. It feels like just testing by inspection,” Patronus AI cofounder Rebecca Qian said. “One company told us it was ‘vibe checks.'”
Patronus AI worked to write a set of over 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning.
Qian and Kannappan say it’s a test that gives a “minimum performance standard” for language AI in the financial sector.
Here’s some examples of questions in the dataset, provided by Patronus AI:
Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
Did AMD report customer concentration in FY22?
What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.
How the AI models did on the test
Patronus AI tested four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2, using a subset of 150 of the questions it had produced.
It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called “Oracle” mode. In other tests, the models were told where the underlying SEC documents would be stored, or given “long context,” which meant including nearly an entire SEC filing alongside the question in the prompt.
GPT-4-Turbo failed at the startup’s “closed book” test, where it wasn’t given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times.
It was able to improve significantly when given access to the underlying filings. In “Oracle” mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time.
But that’s an unrealistic test because it requires human input to find the exact pertinent place in the filing — the exact task that many hope that language models can address.
Llama2, an open-source AI model developed by Meta, had some of the worst “hallucinations,” producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents.
Anthropic’s Claude2 performed well when given “long context,” where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.
After running the tests, the cofounders were surprised about how poorly the models did — even when they were pointed to where the answers were.
“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”
Even when the models performed well, though, they just weren’t good enough, Patronus AI found.
“There just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong one out of 20 times, that’s still not high enough accuracy,” Qian said.
But the Patronus AI cofounders believe there’s huge potential for language models like GPT to help people in the finance industry — whether that’s analysts, or investors — if AI continues to improve.
“We definitely think that the results can be pretty promising,” said Kannappan. “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”
An OpenAI representative pointed to the company’s usage guidelines, which prohibit offering tailored financial advice using an OpenAI model without a qualified person reviewing the information, and require anyone using an OpenAI model in the financial industry to provide a disclaimer informing them that AI is being used and its limitations. OpenAI’s usage policies also say that OpenAI’s models are not fine-tuned to provide financial advice.
Meta did not immediately return a request for comment, and Anthropic didn’t immediately have a comment.
ServiceTitan, a company that sells software to contractors such as plumbers and roofers, on Monday filed to go public on the Nasdaq under the ticker symbol “TTAN.”
The filing suggests that investors could be getting more interested in next-generation software companies. Just a few, including Reddit and Rubrik, debuted on public markets in the U.S. this year, and chipmaker Cerebras filed for an initial public offering. There were basically no tech initial public offerings in 2021 or 2022 as central bankers pushed up interest rates to flight inflation, making investors less willing to bet on money-losing challengers.
Based in Glendale, California, ServiceTitan offers cloud software for advertising, scheduling jobs, dispatching, producing invoices and taking payments. It had a $35.7 million net loss on $193 million in revenue in the quarter that ended on July 31, according to the filing. Revenue was up about 24% year over year, and the quarterly loss had narrowed from almost $52 million.
ServiceTitan’s revenue growth rate will stand out for people investing in cloud stocks, who have seen rates sag with few new public companies in the sector. The average growth rate for Bessemer’s Nasdaq Emerging Cloud Index, the basis for the WisdomTree Cloud Computing Fund, is 16.6%.
The company was originally founded in 2007 by Ara Mahdessian and Vahe Kuzoyan, whose fathers were both residential contractors. While most ServiceTitan customers are small and medium-sized businesses, it has started focusing more on selling products to big companies and construction customers, according to the filing.
ServiceTitan plans to keep up to 5% of shares in the IPO for eligible clients, the founders’ friends and family members and others through a directed share program.
Investors include Battery Ventures, Bessemer Venture Partners, Iconiq and TPG. Iconiq on its own controlled 24% of the compan’s Class A shares.
Competitors include Salesforce and SAP, along with specialty companies such as HouseCall Pro, Jobber and Workwave.
Goldman Sachs, Morgan Stanley, Wells Fargo and Citigroup are among the company’s IPO underwriters.
Tesla CEO Elon Musk (R) joins former U.S. President and Republican presidential candidate Donald Trump during a campaign rally at the site of his first assassination attempt in Butler, Pennsylvania, on Oct. 5, 2024.
Jim Watson | Afp | Getty Images
Tesla shares jumped on Monday following a report that President-elect Donald Trump’s transition team are planning to make a federal framework to regulate self-driving vehicles a top priority for the U.S. Transport Department.
As of 6:11 a.m. ET, Tesla stock was up 7.98% in U.S. premarket trading after the release of the Bloomberg News report, which cited unnamed sources familiar with the matter.
CNBC could not independently verify the report and has requested comment from the Trump team and from the National Highway Traffic Safety Administration, a Transportation Department unit tasked to oversee self-driving technologies.
Musk was a central figure in the business world pushing for Trump’s return to the White House in the lead-up to this month’s elections. The tech billionaire now stands to benefit from the close relationship he has formed with the Republican politician, who previously served a first presidential term between 2017 and 2021.
Last week, Trump picked Musk and former Republican presidential candidate Vivek Ramaswamy to lead the newly minted Department of Government Efficiency — or “DOGE for short — which he said would end government “bureaucracy,” relax “excess” regulations and cut “wasteful” expenditures.
A federal framework for regulating self-driving vehicles would be a major boon to Musk’s Tesla, which has been promising fully self-driving vehicles for several years but has so far failed to deliver a car capable of being driven autonomously without a human behind the wheel.
The long-term vision for Tesla is to produce a fleet of so-called “robotaxis,” autonomous vehicles that can drive people around without the need for human supervision.
Last month, Musk showed off Tesla’s long-awaited robotaxi — a concept car called the “Cybercab,” a $30,000 two-seater vehicle with no steering wheels or pedals.
Tesla has already been beaten to the punch in the robotaxi race by Google’s Waymo venture, which is among the few companies that have successfully launched self-driving cars on public roads.
Speaking during an event unveiling Tesla’s Cybercab and “Robovan” vehicles, Musk said he expects Tesla to have “unsupervised” Full Self-Driving technology up and running in Texas and California next year in the company’s Model 3 and Model Y electric vehicles.
Full Self-Driving, or FSD, is Tesla’s premium driver assistance system, currently available in a “supervised” version for Tesla electric vehicles. FSD currently requires a human driver at the wheel, ready to steer or brake at any time.
Trump’s transition team is reportedly looking for policy leaders for the Transportation Department to develop a federal regulatory framework for self-driving vehicles, according to Bloomberg.
They include Emil Michael, a former Uber executive, Republican Representatives Sam Graves of Missouri and Garret Graves of Louisiana, Bloomberg reported.
Thomas Plantenga, CEO of used fashion resale app Vinted, on center stage during Web Summit 2024 in Lisbon, Portugal.
Harry Murphy | Sportsfile for Web Summit Getty Images
LISBON, Portugal — Tech CEOs in Europe are urging region al countries to take bolder action to tackle Big Tech’s dominance and counter reliance on the U.S. for critical technologies like artificial intelligence after Donald Trump’s electoral win.
The Republican politician’s victory was a key topic among prominent tech bosses at the Web Summit conference in Lisbon, Portugal. Many attendants said they’re unsure of what to expect from the U.S. president-elect, citing this unpredictability as a core challenge at present.
Andy Yen, CEO of Swiss VPN developer Proton, says Europe should echo American protectionism and adopt a more “Europe-first” approach to technology — in part to reverse the trend of the last two decades, during which much of the Western world’s most important technologies, from web browsing to smartphones, have become dominated by a handful of large U.S. tech firms.
VPNs, or virtual private networks, are services that encrypt data and mask a user’s IP address to hide browsing activity and bypass censorship.
“It’s time for Europe to step up,” Yen told CNBC on the sidelines of Web Summit. “It’s time to be bold. It’s time to be more aggressive. And the time is now, because we now have a leader in the U.S. that is ‘America-first,’ so I think our European leaders should be ‘Europe-first.'”
One key push for the past decade from the European Union has been to take legal action and introduce tough new regulations to tackle the dominance of large technology players, such as Google, Apple, Amazon, Microsoft and Meta.
As Trump prepares to come into power for a second mandate, concerns have now mounted that Europe might reel in its tough approach to tech giants out of fear of retaliation from the new administration.
US Big Tech playing ‘extremely unfairly’
Proton’s Yen, for one, urged the EU not to water down its attempts to rein in America’s tech giants.
“Europe has been thinking in a very globalist mindset. They’re thinking we need to be fair to everybody, we need to open our market to everybody, we need to play fair, because we believe in fairness,” he told CNBC.
“Well, guess what? The Americans and the Chinese didn’t get the memo. They have been playing extremely unfairly for the last 20 years. And now they have a president that is extremely ‘America-first.'”
Mitchell Baker, former CEO of American open internet non-profit Mozilla Foundation, said the EU’s DMA has led to meaningful changes for the Firefox browser, with activity increasing since Google implemented a “choice screen” on Android phones that enables users to select their search engine.
“The change in Firefox new users and market share on Android is noticeable,” Baker said. “That’s nice for us — but it’s also an indicator of how much power and centralized distribution that these companies have.”
She added, “This change in usage because of one choice screen isn’t the full picture. But it is an indicator of the kind of things that consumers can’t choose and that businesses can’t build successfully because of the way the tech industry is structured right now.”
Thomas Plantenga, CEO of Lithuania-headquartered used clothing resale app Vinted, urged Europe to take the “right choices” to ensure the continent can “fend for ourselves” and does not get “left behind.”
“If you look very realistically at what countries do, they try to take care of themselves and they try to form coalitions to be stronger themselves, and as a coalition be stronger,” Plantenga told CNBC in an interview. “We have a lot of very talented, well-educated people.”
“We need [to] ensure that we can take care of our own safety, that we can take care of our own energy, that we ensure to keep on investing in our education and innovation so that we can keep up with the rest [of the world],” he stressed. “If we don’t, then we’ll be left behind. In every collaboration, it’s always a trade. And if we don’t have much to trade, we become weaker.”
‘AI sovereignty’ now a key battleground
Another theme that attracted much chatter on the ground at Web Summit was the idea of “AI sovereignty” — which refers to countries and regions localizing critical computing infrastructure behind AI services, so that these systems become more reflective of regional languages, cultures and values.
With Microsoft becoming a key player in AI, concerns have surfaced that the maker of the Windows operating system and Office productivity tools suite has secured a dominant position when it comes to foundational AI tools.
The tech giant is a key backer behind ChatGPT maker OpenAI, whose technology it also heavily uses in its own products.
For some startups, Microsoft’s decision to embrace AI has resulted in harmful, anti-competitive effects.
Last year, Microsoft hiked the fees it charges search engines to use its Bing Search APIs, which allow developers access to the tech giant’s backend search infrastructure — in part because of higher costs attached to its AI-powered search features.
“They’re gradually reducing our revenue — we’re still relying on them — and that reduces our capacity to do things,” Christian Kroll, CEO of sustainability-focused search engine Ecosia, told CNBC. “Microsoft is a very fierce competitor.”
CNBC has reached out to Microsoft for comment.
Ecosia recently partnered with fellow search provider Qwant to build a European search index and reduce dependence on U.S. Big Tech to deliver web browsing results.
Meanwhile, the European Union’s AI Act, a landmark artificial intelligence law with global implications, introduces new transparency requirements and restrictions on companies developing and using AI.
The laws are likely to have a big impact on predominantly U.S. tech firms, since they’re the ones doing much of the development of — and investment in — AI.
With Trump set to come into power, it’s unclear what that could mean for the global AI regulatory landscape.
Shelley McKinley, chief legal officer of code repository platform GitHub, said she can’t predict what Trump will do in his second term — but that businesses are planning for a range of different scenarios in the meantime.
“We will learn in the next few months what President-elect Trump will say, and in January we will start seeing some of what President Trump does in this area,” McKinley said during a CNBC-moderated panel earlier this week.
“I do think it is important that we all, as society, as businesses, as people, continue to think about the different scenarios,” she added. “I think, as with any political change, as with any world change, we’re still all thinking about what are all of the scenarios we might operate.”