I. The Annual Forecasting Contest
…is one of my favorite parts of this blog. I get a spreadsheet with what are basically takes - “Russia is totally going to win the war this year”, “There’s no way Bitcoin can possibly go down”. Then I do some basic math to it, and I get better takes. There are ways to look at a list of 3300 people’s takes and do math and get a take reliably better than all but a handful of them.
Why is this interesting, when a handful of people still beat the math? Because we want something that can be applied prospectively and reliably. If John Smith from Townsville was the highest scoring participant, it matters a lot whether he’s a genius who can see the future, or if he just got lucky. Part of the goal of this contest was to figure that out. To figure out if the most reliable way to determine the future was to trust one identifiable guy, to trust some mathematical aggregation across guys, or something else.
Here’s how it goes: in January 2023, I asked people to predict fifty questions about the upcoming year, like “Will Joe Biden be the leading candidate in the Democratic primary?” in the form of a probability (eg “90% chance”). About 3300 of you kindly took me up on that (“Blind Mode”).
Then I released the list of 3300 x 50 guesses, and asked people to analyze them with the aggregation algorithm of their choice to produce what they thought was the best possible list. 460 of you took me up on that (“Full Mode”).
Then I waited until 2024 and sent everything to Eric Neyman, who’s better at math than I am. He used the Metaculus scoring function to assess everyone’s accuracy. Thanks to Eric (and to Sam Marks, who helped last time around) for taking care of this.
II. And The Winners Are . . .
For Blind Mode - where you had to rely on your wits alone and couldn’t spend more than five minutes per question - the winners are:
Small Singapore gave me no information except this pseudonym and won’t answer any emails. I don’t even know how to give them their prize money. Please email me at scott@slatestarcodex.com if this is you.
Ian, again, gave no information except a name and email address. If your name is Ian and you think you might have won this contest, please check your spam folder.
Vaclav Rozhon is a PhD student in theoretical computer science, and creates algorithm videos on the YouTube channel Polylog. He is often around Zurich or Prague and says he is happy to meet other math nerds or young parents.
Adam Unikowsky studied physics and EECS as an undergrad, then became a lawyer specializing in appellate & Supreme Court litigation. He has a Substack specializing in legal issues. He adds: “I haven't really done any forecasting before, I just follow the news.”
Kiran Saini is a training surgeon in Oxford, and has a forthcoming book about core surgical training. He runs an impact-focused charity called OxPal that helps train doctors in Palestine. He says “I have no forecasting experience, but have long been interested in forecasting theory.”
And there was also Full Mode, where you could read everyone else’s predictions first, check prediction markets, apply whatever algorithms you wanted, and take as long as you needed. While the Blind Mode winners were amateurs or completely unidentifiable, the Full Mode winners were mostly long-time forecasting veterans.
Douglas Campbell is an economics professor, former member of President Obama’s Council of Economic Advisors, and analyst for the Democratic National Committee. He currently runs Insight Prediction, a cryptocurrency-based prediction market. Despite him owning a prediction market, our questions didn’t overlap with his and he gained no advantage from it. He still got the single highest score of anyone in this tournament.
Wilson Chan has an academic background in political science, but works as a software engineer at Walmart. He says he’s particularly interested in international relations and public policy and that might have helped with his predictions. He enjoys tennis, jazz music, and Rutgers football, and can be reached on twitter at @wc1766.
Leonard B. lives in Oregon, and works in real estate development and asset management. He started forecasting during the pandemic, has qualified as a "superforecaster" since 2022, and has recently been doing some work at the Swift Centre For Applied Forecasting. He's "lbiii" on various forecasting platforms (especially Metaculus) and says "I like to hear about cool projects to get involved in, and am especially keen to connect with folks who are working to make forecasting more visible and decision-relevant to policymakers - reach out to possiblylenny@gmail.com"
Ezra Karger is an economist at the Federal Reserve Bank of Chicago, and the research director at the Forecasting Research Institute.
Andrey S is a psychologist in Israel with a background in computer science. He started forecasting on Metaculus a few years ago, and describes himself as "always interested in learning and expanding my point of view".
Max Langenkamp works on hardware and policy for biosecurity at SecureDNA. He says he’s been keeping track of forecasts to private questions for several years but “isn’t motivated” by most prediction market questions. He blogs about “meaning and enactivism” at Unruly Sun.
Here are some other scores I found interesting:
Adam, a software product manager from the US, got 8th place in Blind Mode. He wins the prize for “best score of anyone who answered all fifty questions”.
Eric Neyman handled the scoring for me this year. He got 48th place in Blind Mode, putting him in the 98.5th percentile. Suspicious!
Peter Wildeford, a superforecaster who got 20th place last year, got 12th place in Full Mode and 65th in Blind Mode this year, putting him in the 98th-99th percentile.
Metacelsus, a popular ACX commenter and author of the blog De Novo, got 20th place in Full Mode and 105th in Blind Mode, putting him in the 95th-97th percentile.
I got 400th place in Blind Mode, putting me in the 88th percentile.
III. What Did We Learn?
Okay, fine, but you don’t know most of these people. The really interesting question is how individuals like these compare to prediction markets, experts, and the wisdom of crowds. How much of their success is luck vs. skill? And if we have data like this next time, how do we best predict the future?
Here’s what I’ve got:
Going over this bit by bit:
Median participant: Score of 0 and 50th percentile by definition. Is this for median participant in Blind Mode (due date in January, couldn’t check others’ guesses, < 5 minutes research) or Full Mode (due date in February, could check others’ guesses, unlimited research)? It doesn’t matter! For some reason, these two contests had almost exactly the same median score! I’m unprincipledly lumping them together for the rest of the discussion - when I cite prediction market numbers, it will be from somewhere in the middle of their January and February scores.
50% on everything: If you literally guessed 50% for all your predictions, you would have done very slightly better than our average participant someone who got the mean on each question.
Median superforecaster: 56 people who had been previously declared “superforecasters” (usually by doing very well in a previous tournament) were kind enough to participate. These people did better than average, but not by too much - the median superforecaster scored in the 70th percentile of all participants.
Median 2022 winner: Did our winners win by luck or skill? One way of assessing this is to see how the 2022 winners did this year. Of the 15 top-scoring 2022 participants, 5 of them foolishly decided that instead of resting on their laurels they would try again this year. On average, they scored in the 88th percentile - ie 395th place. I conclude that overall, most winners are around the 90th percentile of skill - but it’s luck that brings them the rest of the way to the leaderboard.
Manifold Markets: Manifold, a popular play money prediction market site, kindly agreed to open markets into our fifty questions so we could compare them to participants. The markets got between 80 and 1500 participants, average around 150. Their forecast, had it been a contestant, would have placed in the 89th percentile. This would be good for an individual, but it’s surprisingly bad for an aggregation method - in fact, it’s worse than taking the median of a randomly selected group of 150 participants! The market mechanism seems to be subtracting value! Someone might want to double-check this.
Participant aggregate: This is the “wisdom of crowds” one. If you average the guess of every participant (eg if someone says 80% chance Biden leads, and another says 90% chance, then you go with 85%), you usually do better than the vast majority of individuals. In this case, the aggregate was 95th percentile, beating out superforecasters and Manifold.
Superforecaster aggregate: If you just average the guesses of superforecasters, you do even better. This isn’t trivial - superforecasters are a smaller crowd than the set of all participants - but in this case the higher-quality data trumped the larger crowd size.
Samotsvety: Samotsvety is a well-known forecasting team that usually wins these kinds of things. You can read more about them here. They scored 98th percentile, better than the aggregate of all other superforecasters. There’s are a few asterisks on this result: first, it wasn’t exactly a team effort - one of their forecasters did the work and “ran it by” everyone else without getting any objections. Second, for complicated legal reasons that they explained and which satisfied me, they couldn’t enter the contest proper and had to send me their guesses later, so I had to take it on trust that they were made in January along with everyone else’s.
Metaculus: A “forecasting engine” that serves the same role as a prediction market but operates slightly differently. They ask everyone to guess a question, then aggregate answers weighted by past performance and a proprietary algorithm. Metaculus scored in the 99.5th percentile of our contest and was the top performer other than random individuals who might have just gotten lucky.
Ezra Karger: …is a possible exception to the above claim. He’s a non-random individual - director of the Forecasting Research Institute - and has previously placed very highly in contests like these (he placed 7th in last year’s ACX contest). Based on this, I suspect his performance was mostly repeatable skill and not just luck. He outscored all but four of our 4,215 Blind Mode and Full Mode participants, which puts him above the 99.9th percentile. Since he entered Full Mode, he was allowed to do complicated technical things, and he described his method as:
I began by collecting data from Manifold Markets for these questions. I then compared those forecasts to the forecasts of superforecasters in the blind data, subset to those who had given forecasts on the S&P500 and Bitcoin questions that were reasonably consistent with the efficiency of markets; I subset to those who forecasted between 30% and 80% for the probability that the S&P500 and Bitcoin would increase during 2023, which were the only reasonable predictions by the time blind mode ended in mid-January. I then used my own judgment to tweak forecasts where I strongly disagreed with the prediction markets and the superforecasters (for example, I was more than 15 percentage points away from the average of Manifold Markets and the efficient-market-believing superforecasters on questions 17, 19, 21, 30, 34, and 50). I paid especially close attention to questions where late-breaking news made the superforecasters' forecasts less relevant (and I downweighted their forecasts on those questions accordingly).
Small Singapore won Blind Mode. As I said before, they’re a total mystery to me and I don’t know if they won by luck or not.
Douglas Campbell runs a prediction market, which I guess also makes him non-random, but I hadn’t previously heard of him being an exceptional forecaster himself, so I don’t know how much to weight this. He describes his method as:
I mostly didn't research anything too much. But, I did consult Metaculus for a few of these.
(I kind of want to make a Virgin vs. Chad meme comparing his answer with Ezra’s, but I’ll restrain myself out of respect for the dignity of our participants.)
IV. Out Of Distribution Events
Another fun thing we can do with these data is see which 2023 events were most vs. least surprising:
The first colored column represents average score on each question. A more negative number means that more people got the question wrong (gave a low probability for something that happened, or a high probability for something that didn’t).
The second colored column represents correlation between each question and overall score. A question where good forecasters beat bad forecasters is positive; a question where bad forecasters beat good forecasters is negative.
Why would bad forecasters ever beat good forecasters? This means an event was unlikely, but happened anyway. For example, if people were asked to predict if some random person would win the lottery, smarter people would be more likely to predict no. If by coincidence he did win the lottery, then smarter people would have lower scores than dumber people.
I’m torn which of these matches our intuitive conception of “surprising event”, but both methods suggest forecasters were very surprised that Bitcoin ended the year over $30,000 (it started the year around $16,500, and ended at $43,000). Bitcoin is now up to $68,000, which I imagine would have been even more surprising to these people!
(weirdly, good forecasters were more likely than bad forecasters to believe Bitcoin would go up at all, but less likely to believe it would go up as much as it did)
Other resolutions that book people by surprise: that Starship didn't reach orbit, that inflation dropped so fast, and that Joe Biden's approval rating stayed as low as it did.
The least surprising thing about 2023 was that nobody used a nuclear weapon.
V. Takeaways And Thanks
My main takeaway is that Metaculus beats prediction markets, superforecasters, wisdom of crowds, and (probably, most of the time) Samotsvety. Based on the performance of last year’s winners, most people who outperform Metaculus do so by luck and will regress to the mean next year. This contest leaves open the possibility that a small number of people (maybe including Ezra Karger) might be able to consistently get super-Metaculus performance - it just takes more than one contest to identify them.
This doesn’t mean that most prediction markets and superforecasters are useless. It just means that their benefit comes from being faster and easier to invoke than Metaculus, not from being more accurate.
Metaculus is hosting a 2024 version of this contest, which due to my delay in getting this up is already closed. I’ll let you know how it goes. And hopefully I’ll have enough time next year to be more involved in the 2025 version.
Thanks to everyone who participated in this contest. Extra thanks to Christian Williams from Metaculus and the Manifold team for getting their respective sites involved, to Jonathan Mann and Samotsvety for willingly submitting to testing, and to Eric Neyman for calculating the scores.
If you included an ID key in your entry, you can find your score here:
Share this post