Toby Shevlane on Building AI Forecasters

Talking About the Future

0:00

-49:20

Toby Shevlane on Building AI Forecasters

"In a few months time it’s my belief that it will no longer be clear that systems like Mantic are below the performance of the human pros"

Robert de Neufville

Nov 13, 2025

In this episode of Talking About the Future, I talk with Mantic CEO and co-founder Toby Shevlane about teaching AI models to perform at the level of top human forecasters. You can listen to our full conversation—which covers a lot more than this post—using the audio player above or on most podcast platforms. Excerpts from our conversation, edited for clarity, are below. If you enjoy our conversation, please share it with others!

Toby, why is judgmental forecasting a hard problem for AI? Why can’t I just take the latest version of Gemini out of the box and ask it to tell me which party is going to win the most seats in the Samoan general election?

TS: Yeah, it’s a great question. I think it’s partly because it’s a research problem in the sense that it’s not a self-contained question. You need to go out and find information in the world, so at the least you need a deep research agent. As a deep research agent, I think they’re actually probably not bad right now at forecasting. I mean, they’ve gotten much better in the last year or so at making these kind of predictions. We started about a year ago, and at that time, it felt like the models, when it comes to the point of making the prediction, giving your probability estimate, it felt like they weren’t really trying very hard, maybe ignoring important information, maybe just not thinking deeply enough about the problem. Now that we have these reasoning models, they’re actually much better at the task. So I think just out of the box, just a single call to an LLM to say, give me the answer to this question, it won’t be well informed enough, but if you add the deep research aspect, then they’re really not bad, I think. I’d say the problem we’re trying to solve isn’t, how can you make a pretty good forecast? We’re trying to solve the problem of pushing out the frontier of how accurate can that forecast be....

Does your forecasting AI currently know that the probability of all the possibilities has to add up to 100%? Does it understand Bayesian updating? Would it pass a test if I tried to poke holes in its reasoning and find logical inconsistencies and scope fallacies and things like that?

TS: Yes and no. There’s this interesting paper from Daniel Paleka about consistency evaluations of LLM forecasting systems.1 The idea there is instead of just evaluating the AI system’s forecasting ability by getting it to make a bunch of predictions and then scoring their accuracy over time, if you don’t have time to do that, you could instead ask it to answer some forecasting questions where there’s some kind of logical relationship between the forecasting questions and see if it violates that logical consistency. For example, if you have an election and you ask—let’s take the UK, for example—and you say, what’s the likelihood that Reform will win the next UK general election? They’re doing very well in the polls at the moment. So you could ask that question in isolation, but then you could also say, okay, well, what’s the likelihood of the Labour Party winning the next general election? And then ask that question in isolation and do the same for all the political parties. Then, to be safe, to make it collectively exhaustive, you could say, and some other party that’s not those. You really want all those separate forecasts, all the probabilities you gave, to add up to one, like you said. And I don’t think that’s true normally for an AI system making these kind of predictions. I wonder if it’s true for humans. Maybe not. I actually think that’s a really useful jumping off point for doing research to improve the accuracy, funnily enough. Because, if you think about it, if they’re violating these kind of logical consistency rules, then you’re leaving something on the table, right?...

October 13, 2025 tweet by Mantic reading, "Inside look: our quest to beat the top human forecasters." Below is a screenshot of Toby Shevlane at a table from an attached video.

How do you design AI to predict the future accurately? Can you walk me through your approach to training, prompting, scaffolding, whatever it is?

TS: Yeah, definitely. So with AI systems there’s a potential problem with forecasting, which is, you might ask a question and then the answer is not known when you ask the question. You have to wait. So if I ask this question about who’s going to win the next UK general election, we might not know that for another three or four years. You don’t really want that in your research iteration loop, making a change and then waiting three or four years to see if the change was good. So, ideally, you could be running experiments where you instantly can score the change that you’ve made and compare it against the baseline of the predictions we were making with our baseline system this morning as a comparison. The way you get around this problem, which I think is actually a massive opportunity, is to do backtesting. So you leverage the fact that you can’t go back for two or three years because if you go back two or three years and ask a question from the perspective of 2022, then the AI system might say, oh, that’s easy, I already know the answer because I am powered by lots of LLMs that know everything. So you can’t go back too far, but you can go back to the knowledge cutoff of the LLM. Often that’s around a year, so you can ask a forecasting question from the perspective of about a year ago where you already know the answer, and then you can instantly score the accuracy of the prediction....

TS: Then on top of that, I think there’s really three different directions for improving the accuracy of the predictions, which are data, scaffolding, and model training. So we’ve talked already a bit about model training. We make tens of thousands of forecasting questions from the perspective of the past and then use reinforcement learning to train the model to be good at making the prediction. In a way, making the prediction is like the last part of the workflow. The overall prediction engine, it kind of performs like a deep research agent that breaks down the problem and does lots of research. So from that perspective, you have to think about the scaffolding and the data. When I say data here, I mean separate from the training data for the reinforcement learning project. I mean for information retrieval. So it’s like, okay, I’ve got a forecasting question about US politics. I need to go off and find lots of information about US politics right now. Obviously, one way of making better predictions is just being better informed. So what we need to do is just onboard more and more sources of data to the prediction engine, which we’re doing. You can take some low hanging fruit by using these like very general data sources that are useful across many different questions, like the news, like economic data, and so on. But then you can really go beyond that. I think it’s a long tail of data sources that can potentially be helpful. If you think about what hedge funds do, they’re constantly spending millions of dollars a year on data. You know, it can be niche sometimes. It can be like how many cars are in the car park of this supermarket, this kind of stuff. In the ideal world, our prediction engine is very well informed because we’ve just onboarded this very wide collection of data sources. And then the scaffolding is basically—what I mean by that is because we’re not just making one call to an LLM, we’re making hundreds of calls to an LLM per prediction, how do you organize that whole workflow, like the research process, the process of kind of breaking down the problem, the process of bringing it back to prediction, adjusting the prediction and so on?...

October 22, 2025 tweet by Mantic reading, "The Market Pulse competition is for finance predictions: company earnings, treasury yields etc. We competed in Q3 and finished: - 19th / 122 entrants - Highest ever AI - 2.3x the score of the next best AI 🎯 Our best prediction was Nvidia forward guidance on margins (bullish)."

What kind of progress have you seen? I know you did well in the Metaculus Cup. How is it performing so far?

TS: Yeah, so we’ve made a lot of progress in the last few months, I’d say. We originally were competing in these AI-only tournaments from when we started the company that are on Metaculus. We were doing quite well in those. We won the top prize money in Q1, but then we didn’t do as well as we wanted in Q2. That kind of spurred a lot of hard work. I think we improved the performance a lot starting from around July this year. Around that time, we started to enter these human tournaments, so the Metaculus Cup—there’s over 500 people in it. It runs three times a year. So it seems like one of the flagship tournaments on Metaculus. So we entered that, and we also entered the Market Pulse competition, which is about finance questions, and also on Metaculus. Also it’s primarily for humans. And we did quite well in both of these. So in the Metaculus Cup—to paint a picture here—the last time this equivalent competition had run in Q2, there was a question about how well the best AI system would do relative to the top five humans. Ultimately, that resolved to, I think, it got about a third of the score. Then there was the same question again this time around, the summer that we participated in, and the community prediction started off being similar. It started off thinking that the best AI system would get about a third the score of the best humans, a little bit more. And without Mantic in the competition, that would have been a little bit overoptimistic, actually. I think it was about 32% that this Gemini-powered AI system got that was the second best AI. For us, it was 83% of the score. So we actually came quite close to the score of the top five humans in the competition. I guess you can think about it like this kind of break from the trend or step up in performance that wasn’t really expected—even by us, really....

Do you have a prediction? When are you going to be able to beat me?

TS: Well, I mean, are you participating in any of the tournaments we are? If you want the answer to that, I’d encourage you to take part in the competitions we’re in. But we’ve already beaten a few pro forecasters in some of these competitions. It’s hard to say—I mean, if the prediction is about when are we going to become number one, that’s quite hard to say because there’s obviously quite a bit of noise on exactly who becomes number one across the different tournaments. But in a few months time it’s my belief that it will no longer be clear that systems like Mantic are below the performance of the human pros....

Thank you for listening to Talking About the Future! Related episodes include “Michael Story on Making Useful Forecasts” and “Josh Rosenberg on Forecasting Research.” If you enjoyed this post I’d be grateful if you shared it with others or liked it by clicking the heart button below. And please subscribe if you haven’t already!