In the second episode of Talking About the Future, I interview Metaculus Project Director and superforecaster Tom Liptay about Metaculus’ new system for scoring forecasts. You can listen to my full conversation with Tom using the audio player above or on most podcast platforms. Excerpts from our conversation, edited for clarity, are below.
Let's start by stepping back and looking at the big picture. What is Metaculus, and what is it trying to do?
TL: Metaculus is an online platform where users can sign up for a free account. They can submit forecasts on a wide range of questions about the world. It typically has more of a science orientation about geopolitics. And the key thing we do is we keep score of those forecasts and ultimately give people an accuracy ranking so we can identify who the most accurate forecasters are.
What's the point of doing that? Are you trying to produce useful forecasts? Are you trying to identify good forecasters? Are you trying to teach people forecasting skill? Is it just fun?
TL: All of the above, but some are more important than others. Ultimately, we aspire to improve decision-making on the most consequential, complex topics facing people. So our ideal target audience would be policy makers who are making real decisions that affect the world, and we're trying to help inform them. But in order to get to that place you want to do a bunch of other things, including identifying the best forecasters so that you can provide them with the best, most accurate forecasts and the best commentary.
So ultimately one of the goals—or maybe the primary goal—of Metaculus is to produce some output which is useful to people who are making real world decisions about, say, politics?
TL: Yeah, or make real world policy decisions. In our ideal world, we would be read at the White House....
Why did Metaculus decide to redo the scoring system now? What are you trying to improve about the old scoring system? Were there problems with it?
TL: We were trying to make it simpler and more understandable. We were trying to make it more motivating for forecasters. We were trying to make it more consistent, more coherent, and provide more information on who is more accurate.
The old Metaculus points were great. It was a proper scoring system, which means you were incentivized to always submit your true probability, and there's no there was no way to game it. If you were answering many questions in order to get the best possible score, you would submit your true probability. But it was a cumulative score. We added up your Metaculus points on all questions, so the only way to get to the very top of the leaderboard was to answer a lot of questions. Of course, it helped to answer those accurately, but you still needed to answer quite a few.
In our new scoring system, we have two scores. One is a baseline score, which compares you to chance, and where we still do the summing. So you want to be accurate, but you also want to forecast a large volume of questions. And we've introduced a new metric which we call the “peer score,” which compares you to every other forecaster and takes your average. So by doing that a more time-constrained forecaster—perhaps like you or I—can answer 40 questions in a year, and is not at a disadvantage....
Were there people that were at the top of the leaderboard who were just answering a lot of questions and then not actually providing much useful forecasting information?
TL: I love this question, because when I joined Metaculus, I would see our top forecaster. I don't want to embarrass him in this interview, but he was crushing it. He was at the top of the leaderboard, consistently beating everyone. It was a question in my mind. Is this guy really an amazing forecaster, or is it just because he's answering lots of questions? And I didn't know the answer to that.
I really didn't even know the answer to that until we fully put together our new system. Then you could go back and look historically, and you could see that there were years where he had the number one score in baseline, which was the sum of scores. That means he was answering a lot of questions, probably all of them, and at the same time he was number one on the peer score, which means he was beating everyone by the biggest margin of any other forecaster. In general, you'd expect someone who was optimizing for one of those strategies to have different behavior, so the fact that he was able to do both simultaneously is incredibly impressive. There was no way to see that with Metaculus Points, and you can see that very clearly with our new system....
In general, how do you measure forecasting ability or forecasting performance? Is there a single metric that can capture it?
TL: It's a great question. What we do is we use a proper scoring method, which means that your incentive, if you're trying to get the best score in expectation over many questions, is to submit your true probability. We view that as an essential principle for Metaculus scoring.
But there are many ways to do proper scoring, and depending on your choice of scoring method, you will get a different ordering of who is most accurate. Perhaps the two most common scoring metrics are the Brier score, which is your squared error. That’s what Good Judgment Open would use, or what the Good Judgment Project used. Metaculus uses a log score, which is also proper. I don't think you can objectively say one is better than the other. What it really comes down to is that each system punishes certain types of error more than the other one.
The log score is very sensitive near the tail ends of the probability distribution. The example that I like to give is if you say there’s 0% chance of something happening, and then it actually happens, with Brier scoring, you get the worst possible score, but it's actually not that much worse than someone who said there’s a 10% chance that it would happen. With log scoring, the story is completely different. If you say 0% and it happened, you get negative infinity, and you can never ever catch up. To make sure that that never happens, on Metaculus, we make sure that you can never forecast 0% or 100%. You can get down to 0.1% or you can forecast 99.9%, but you cannot go all the way to the extremes, because then you’re wiped out forever. So that’s different from Good Judgment Open, where you can forecast 0%. You can be wrong, and you can come back from that.
I'm inclined to say that makes sense to me epistemologically, just in the sense that I don't think it's ever right to say there is a 0% or 100% chance of something happening. You could ask me about something that's just totally absurd, and I'm going to say it's basically 0%. But there's always a chance—because we're not talking about some kind of necessarily true probability about the world—in terms of what we can know, that I could be completely deluded. I could be a brain in a vat. I could just be totally confused. I like to think the chance is small, but it's non-zero. So if I actually were to say 0%, not rounding, I'd probably be wrong in terms of what it's possible for me to know.
TL: I agree 100%. Robert Rubin wrote a great book where he talks about exactly this. I have the same mentality. I'm never willing to actually say 0%. I'll say one chance in a billion, but not 0%. Just to highlight, the founders of Metaculus are physicists. They were concerned about the really big questions about humanity's existence. They care a lot about tail risks, extinction events.
Things with very low probabilities. We think.
TL: Hopefully very low probability. So in that context, I think the log score does make more sense because they care about the tail events and log scores are more sensitive to the tail events. If you're talking about elections where it's more 50-50 and you wanted to differentiate forecasting skill, you probably want to choose a Brier score. I think that's more sensitive. There's no right or wrong answer....
You could do something like a World Series or Super Bowl, have direct head-to-head friendly competitions between different platforms like Good Judgment Open or prediction markets like Manifold. Do you think that's a thing that could ever happen? Just an annual best-of competition. It could be like an esport.
TL: Yeah, you could do it between individual forecasters. I think you could do it between platforms. Maybe a more realistic place where you might see this first is we have some college initiatives going right now. We have students within different colleges forecasting on the same questions and they can compete against each other to see which college is more accurate. There have already been some pilots done. We've partnered with Optic and we hope to do more of that in the future.
I think you could do it in high school. You could have high school leagues. I think it'd be really fun to teach forecasting to high schoolers—because they're totally capable of doing it—and have them compete against one another in the equivalent of a debate league or something.
TL: I absolutely love that. I would have loved to have learned my history or current events by like, hey, make a forecast on this question. Have all your students make the forecast, write the rationale. A month later, you find out the answer. That's a great way, at least for me, to stay engaged with the real world.
You can listen to the previous episode of Talking About the Future with Swift Centre Director and superforecaster Michael Story here. The intro music to Talking About the Future is “Catch It” by Yrii Semchyshyn. If you enjoyed this podcast, you can make more interviews like this possible by becoming a paid subscriber. You can also support me by rating and reviewing this podcast and by sharing it with your friends and colleagues.
Tom Liptay on Scoring Forecasts