New research find that large language models (LLMs) like GPT-4 can’t compete with human forecasters—and there are reasons to think they won’t be able to any time soon.
There’s no obvious reason why artificial intelligence won’t eventually be able to compete with human forecasters. Forecasting is a hard task for AI because there’s no brute force method for predicting complex domains like politics; it requires judgment. But AI models are competitive in other domains that resist rigorous formal solutions, so it’s possible to imagine they could predict the future competitively too.
One motivation for testing the forecasting abilities of AI models—besides that it would be nice to have an AI oracle that can predict the future—is that forecasting is a stringent test for AI. It’s hard to know in many cases whether an AI model has performed well on a benchmark task because it’s good at that type of task—and is capable of generalizing to out-of-distribution tasks—or because it has learned how to do that specific task as part of its training. But the future isn’t part of any model’s training data. Because the answers to forecasting questions aren’t known ahead of time, they can’t be memorized in advance. Forecasting is a potentially valuable test of AI capabilities because it requires the ability to reason about truly novel situations.
Last year, Philipp Schoenegger and Peter S. Park tested the ability of GPT-4— OpenAI’s then-frontier large language model (LLM)—to produce forecasts on a diverse range of geopolitical topics over the course of a three-month Metaculus forecasting tournament.1 GPT-4 was prompted to role-play a superforecaster and to use the forecasting techniques the best forecasters do. It didn’t go well for GPT-4. The model was not only unable to compete with the human crowd, but was unable even to beat the no-information strategy of simply predicting 50% on every question (it forecast only binary questions). GPT-4, in other words, wasn’t meaningfully better than a coin flip.
Subsequent work has shown that performance can be improved by incorporating LLMs into an improved forecasting system2 or by combining them into an ensemble model that effectively creates a digital crowd to compete with the human crowd.3 But while these techniques got LLMs closer to human performance, the models were still not really competitive. Now the Forecasting Research Institute (FRI) has a new ForecastBench project benchmarking machine learning systems against a small group of human superforecasters and against the human general public across an automatically updated set of 1,000 forecasting questions.4 The current ForecastBench leaderboard shows that human superforecasters’ median is the most accurate by a significant margin and the ordinary human forecasters’ median is the second most accurate. The LLMs were all behind the human groups. In addition, the top AI models—those that most closely approach human performance—are all models that were allowed to use the human crowd forecasts when making their own forecasts. The top performing model that forecast without any input at all from human forecasters did significantly worse than either of the human groups. The best AI models seem to be better than a coin flip, but they still can’t compete with humans.
AI performance on forecasting tasks will certainly improve. But there are reasons for skepticism that AI will be able to compete with human forecasters soon. LLMs are great at reproducing and remixing the data they’ve been trained on, but haven’t convincingly demonstrated the ability to reason about novel situations that forecasting requires. When interrogated, they seem to struggle with basic tasks like recognizing the probability of all possible outcomes must sum to 100%. FRI has also found that the models it studies perform worse on questions that require reasoning about relationships between events, which suggests their causal reasoning abilities may still be fairly limited. LLMs still appear to be better at role-playing novel reasoning tasks than they are at actually performing them. While it would be great if AI could deliver accurate forecasts at scale, right now you’re probably better off asking a smart analyst to advise you than relying on automated forecasting.
The US election is just 24 days away. At least 4 million people have already voted. Please join me on Thursday, October 17 at 3:30 pm PT/6:30 pm ET for a live chat about the US election with top forecasters Jean-Pierre Beugoms, Scott Eastman, and Atief Heermance. Anyone who is subscribed to Telling the Future can participate, so I hope you will come discuss the election with us. And, as always, if you enjoyed this post, please share it with others!
Metaculus also run a forecasting contest in Q3 2024 with 55 bots and a $30k prize pool, where the humans came clearly on top: https://www.metaculus.com/notebooks/28784/ It'd be great if this was also mentioned in the post! This will be repeated in the next quarters, so we'll see if the bots improve.
Hey Robert I love your stuff, however I have to forecast that this post will not age well. ;-)
In my humble opinion you are not giving enough credit to the current rate of change. If current models already come close they will beat the average human in no time and the superforecasters very soon as well.
Would be great to have a prediction market on that one. :D