Many crucial policy questions are difficult to forecast because it’s hard to measure the accuracy of your forecasts. But a technique called “reciprocal scoring” can help elicit serious forecasts even when there is no direct way of scoring their accuracy.
It was clear by the middle of February 2020—or should have been clear to people paying close attention—that the novel coronavirus would spread widely throughout the US. The infectious disease experts and medical officials on the “Red Dawn” email thread had begun to sound the alarm by the end of January. When nearly 20% of the people aboard the Diamond Princess were infected by the virus—and 2% of the infected died—the concerns of the Red Dawn experts seemed justified. The rapid spread of the disease in South Korea, Italy, and Spain offered little reason to think that things would be different in the US. A simple, back-of-the-envelope calculation suggested the new disease could cause an order of magnitude more deaths than the seasonal flu. Without effective public health interventions, hundreds of thousands of Americans were likely to die of COVID-19 within a year.
People sometimes wonder—reasonably—how much forecasts actually affect decision-making. I was forecasting the spread of COVID-19 for Good Judgment at the time, and knowing what was about to happen certainly influenced my personal decisions in 2020. I decided to let my gym membership lapse because I was confident the virus already was or soon would be circulating on O‘ahu. I made sure my savings were in fairly safe investments in anticipation of a stock market crash (although I unfortunately did not anticipate how quickly the market would recover). I cancelled my plans to present a paper at an international conference here in Honolulu, even though the organizers assured us they were “virtually certain” to hold the conference in March as planned. The conference organizers might have saved themselves some trouble if they had been better forecasters.


But knowing the coronavirus was likely to spread wasn’t that useful in figuring out how to slow or stop the spread of the disease. When I make a forecast, I take into account the likelihood of different policy responses. I don’t typically make recommendations, because I assume that my forecast is unlikely to meaningfully change what happens. The effect is to treat the forecast as a fait accompli we can do very little to change. If you were a policy-maker, you’d want to know not just what’s likely to happen in some business-as-usual scenario, but what you might be able to do to improve the likely outcome. You don’t want to know just how many COVID-19 deaths there are likely to be, you want to know how to reduce that number.
This may be one reason why judgmental forecasting—in which skilled forecasters assign subjective probabilities to possible future events—has played a relatively small role in public policy debates. Simply predicting what’s likely to happen is useful mainly to the extent that it’s obvious what should be done if it does happen. When what you want is to prevent that thing from happening—as you obviously do with a global pandemic—what you want to know is what you can do to stop it.
One approach is to pose conditional questions. That is, you can ask forecasters what’s likely to happen, all else being equal, if any one of a range of different policies were enacted. You can then compare their answers for each different policy scenario, as well as to their forecast for the base case scenario with no policy intervention. You can in this way essentially forecast the effectiveness of a range of prospective policies. The economist Robin Hanson has proposed using betting markets to evaluate how effective policies are likely to be in achieving our desired goals.1 Conditional forecasting could serve basically the same function. It’s unlikely to completely settle political debates—if only because it is impossible to clearly separate the ends of policy from the means we use to achieve them—but it has the potential to be a powerful tool for evaluating policy.
One problem, however, is that it’s hard to know whether our forecasts of conditional questions are any good, because we can’t say whether forecasts for policy options that are never tried—the counterfactual options—were accurate. One of the main reasons forecasting has improved is that we started measuring our performance. It’s reasonable to think that the same strategies we use to forecast the policies we do implement would work for the options we don’t, but without a way to resolve the questions there’s no sure way to know.
In 2020, there was an experimental tournament to test conditional forecasting questions. The tournament consisted of two independent teams of forecasters with a track record of accuracy in previous forecasting exercises. They were asked to forecast COVID-19 deaths under different policy conditions—mandating the use of masks in public spaces, closing schools or universities, banning gatherings of certain sizes, and so on—as well as in a baseline no-policy scenario. Participants were awarded a bonus for how closely their forecasts correlated with the median of the other team’s forecasts. The idea was that—in the absence of a direct way of measuring the accuracy of counterfactual forecasts—this reciprocal scoring system would elicit good faith forecasts by measuring each participant against other skilled forecasters. The result was that the two teams produced plausible, stable, and similar estimates of the likely impact of the proposed public health policies—finding in particular that a three week stay-at-home order was likely to have the biggest impact of any single proposed policy.2
Reciprocal scoring could help elicit serious answers to other types of difficult, important forecasting questions. In particular, it can help us estimate the risk of severe catastrophes like a global nuclear war or runaway climate change. This kind of rare or unprecedented event is hard to forecast because in most cases they don’t occur—and by the time they do it may be too late to do much to improve our forecasts. So researchers now propose to use reciprocal scoring to elicit compelling forecasts of what would be effective strategies for mitigating these existential risks.3 As someone who has worked on existential risk, I’m extremely excited by the potential of this work to help us figure out how to avoid future global pandemics and even worse catastrophes.
Forecast Update
At the end of March, I wrote that the likelihood of a negotiated peace between Ukraine and Russia in the near future was low. I estimated there was just an 8% chance of a bilateral ceasefire before June and a 54% chance of a bilateral ceasefire in 2022, although I later revised those chances up slightly. There hasn’t been much apparent progress since then. With just a little over a month left before June and the two sides actively contesting a large part of Ukraine, I think the chance of an agreement before June is small. Since WWII, the median time it took to reach an negotiated settlement for interstate wars was about 7 months—and less than half of all wars ever ended in an agreement at all.4
My revised forecasts:
3% chance of a bilateral ceasefire agreement across Ukraine before June
41% chance of a bilateral ceasefire agreement across Ukraine in 2022
Good Judgment recently interviewed me for a profile alongside incredible superforecasters Jean-Pierre Beugoms, Kjirste Morrell, and Dan Mayland. As always, if you’re enjoying my work, please help support it by liking it and sharing it with others.