175 Comments

Congratulations to the winners 😊

Expand full comment

How do we see our results if we didn't do an ID key? Can't remember if I did that...

Expand full comment

I also did some in-depth analysis of the Blind Mode data and turned the results into an interactive app at https://jbreffle.github.io/acx-app

The associated code is available at https://github.com/jbreffle/acx-prediction-contest

I still have a few final analyses left to add to the app, but all of the results can be viewed in the notebooks in the repo.

Expand full comment

I was pretty convinced I didn't have an ID key but I'm 90% sure now I was "correcthorsebatterystaple" like the XKCD comic. If this isn't me and someone else let me know. If it is me though, glad I finished *slightly* positive.

Expand full comment

Maybe I'm missing something, but shouldn't Ezra Karger outperforming Metacalculus take into account that he has access to the Metacalculus current state? Much like human + engine outperformed engine for a long time, it's reasonable to expect a skilled individual to improve over the market in predictable rare cases.

Expand full comment

The completely blind average of all guesses placing 95th percentile is astounding to me, and is the most convincing argument I've ever seen for relying on "wisdom of the crowds" for predictions

Expand full comment

Dang I don't think I gave an ID Key just my email. Have the original form saved as a screenshot too but not my answers. Curious how I did.

Expand full comment

Any chance of releasing an .xls to calculate scores for those of us who have our answers saved (e.g., via the all-submissions file) but didn't include / don't remember their IDs?

Expand full comment

Pursuant to my post here: https://old.reddit.com/r/slatestarcodex/comments/1b6akix/what_are_your_favourite_ways_to_combine/

Is there an opportunity to stake someone/a group of forecasters money to enable them to bet in prediction markets? (I'm willing to receive a lower risk adjusted return in exchange for helping strengthen prediction markets)

Is anyone else doing this? If so, I'd be curious to know if there are any written details online.

Expand full comment

> This means an event was unlikely, but happened anyway.

I understand that, in the long term, "black swan" events like this will average out. If everyone is predicting "business as usual", and you are predicting "asteroid strike" year after year, then if an asteroid strike actually happens you will appear to be a prophet just by luck.

On the other hand, though, "business as usual" is not interesting. Anyone can predict that simply by assuming that tomorrow will be exactly like today (adjusted for inflation); it doesn't take a superforecaster or an AI, just copy/paste. The whole point of prediction markets etc. is that they should be able to accurately predict *unusual* events, isn't it ?

Expand full comment

Instead of "50% on everything", I'd use the binomial distribution for probability density, or something similar, and sample randomly from that.

Expand full comment

I successfully managed to figure out where I kept my ID key, but it doesn't appear on the list :(

Expand full comment

Any chance you can include a ranking as part of the ID keys as well? I seem to have done pretty well and want to stroke my ego further...

Expand full comment

So, why is it that the median participant scored worse than random?

Is it mostly explained by the fact that the world is full of people trying to convince people of untrue things? Is it that society collectively constructs a lot of illusions which aren't necessarily intentional? Is it the result of some common property of the particular questions that were asked? Some counterintuitive statistical thing that would show up even in a world where everyone was putting a lot of effort into avoiding bias and nobody was deceptive?

Expand full comment

To help find Small Singapore, I forwarded the screenshot to our local @LessWrong_Singapore group on Telegram.

Expand full comment
Mar 5·edited Mar 5

Congrats to those who won! I'm just glad I did a bit better than the median and better than just guessing 50% on everything.

Expand full comment

"Why would bad forecasters ever beat good forecasters? This means an event was unlikely, but happened anyway."

It can (I think) also mean that the likely outcome happened, but good forecasters were less confident in it than bad forecasters. That seems to have happened to the questions about whether Sevastopol, Luhansk (Russian controlled), or Zaporizhzhia (Ukrainian controlled) would change hands—none of them did, and bad forecasters beat good forecasters on each.

Expand full comment

One of the questions asks about a deepfake attempt that made front page news, and it resolved yes. Could someone link the story?

Expand full comment

You could release a redacted version of Small Singapore's email. Like **anderson*****@gmail.com

Expand full comment

You should also ask a fun essay question: What is the most surprising thing that will happen in this coming year?

Granted, there's no way to objectively evaluate the responses, and it would be tedious to read through a zillion wild guesses. But so what? A year later, just let people grade themselves in the comments section by evaluating their predictions.

Expand full comment

There are lots of email addresses in the Idkeys file, not sure if those folks realized what they were doing. @Scott Feel free to delete this comment if you were hoping for security through obscurity.

Expand full comment
founding
Mar 5·edited Mar 5

I used my email fully expecting to be able to see my score... and so did the 4 friends I convinced to do this.

Pretty please can we get some version that includes the emails? You could hash the emails with high collision (will email proposed solution and get review)

Expand full comment

I know a person who against all the apparent odds and all the pundits made a lot of money on the initial Trump win. Many other issues, but his ASD certainly helps him see through the groupthink. I wonder if this is a common thread. Algorithms can't predict that.

Expand full comment

> weirdly, good forecasters were more likely than bad forecasters to believe Bitcoin would go up at all, but less likely to believe it would go up as much as it did

For the 2+ Bitcoin questions, I'm trying to visualize a possible spread of forecasters on a single scale, from bearish to bullish, based only on this sentence. Is it that bad forecasters tended to be either bearish or ragingly bullish, while good forecasters tended to be somewhat bullish?

If so, a simply story occurs to me. Good forecasters thought that speculators would return to the market before too long, but underestimated how many. Relative to them, bad forecasters have more dramatic expectations for emotive things, but were split between "it was a bubble that will never re-inflate" and "of course we're still heading for the moon, fairly soon".

For the record, I don't think stories like that have much value if we can't come up with them in advance, but at least they might satisfy curiosity.

I give a 70% chance that I'm among the bad forecasters overall, but I don't recall how I answered for Bitcoin, or for anything, when I did blind mode.

Expand full comment

Has anyone performed a factor analysis on this data? In general for forecasting tournaments, it might be a good idea to have coverage of a variety of factors corresponding to different world domains to forecast. It could help disentangle luck vs general forecasting skill vs specific domain knowledge.

I suppose I'm the psychometrics rat so maybe it's my job to do the factor analysis... 😅

Expand full comment

This is a awful question (more math, probably more chance of random results), but what if the skill of forecasting is more specialized? Some people might be good at particular regions or subjects, but not great forecasters in general.

Expand full comment
Mar 5·edited Mar 5

Taking at face value the result that the market mechanism subtracted value (compared with a straightforward average of participants), I have a hypothesis as to why this might be, which is that markets can be surprisingly concentrated, so that a lot of weight is being given to the view of one individual, undermining the benefit of aggregation.

At one point a single user owned a majority of the Manifold Yes shares on question 14. Essentially, a single person who was rich in mana became convinced (wrongly, as it turned out) that the question must resolve to "yes", and pushed the prediction to 90% or so. As I've noted now several times, a single user currently holds 28% of the Trump Yes shares on Polymarket: since late January, they've returned to the market whenever the price has dipped to 53c (which reflects a higher probability than assigned to this outcome by any other prediction site, although not by much). Looking at other candidates, many of them also show 1 or 2 users owning a similarly high proportion of the shares.

These people are no doubt confident in their views, but even if we think that very confident people are in general more likely than random to be right, they are still single individuals, and the whole point of the "wisdom of crowds" is that the aggregate view usually beats the even the wisest individual.

Part of the attraction of markets is that they exclude blowhards: people will confidently assert that their favourite sports team or politician is all but certain to win, but they typically won't bet on that belief, because at some level they know it's false. But, once you have filtered to people who are participating in a prediction market, you have already excluded those people. So the question is, should we expect weighting (in effect) by the size of bets to give a better result than taking a raw average?

At first glance, one certainly might say that a person who is willing to bet £1,000 on an outcome is more confident than a person who is willing to be £10. But there are many other reasons why somebody might not be willing to bet £1,000. They might not have a £1,000 spare, or they might be sufficiently risk averse that they will not tolerate the significant chance of losing the bet, even at favourable odds. This only becomes worse when one considers people who are willing to bet £1,000,000: the vast majority of people would not make a bet in that size at any odds on any proposition. It seems reasonable to guess that the sort of person who is willing to bet £1,000,000 on the outcome of the US presidential election is the same sort of person who is willing to bet £1,000,000 on which horse will win the Grand National, rather than the sort of person who makes extremely accurate predictions.

There is also the problem that it is characteristic of poor prediction agents that they are overconfident: they say an outcome is 99% likely when it is really only 80% likely. That raises the possibility that giving more weight to the views of people who are very confident (as reflected by their bet sizing) is subtracting value.

Expand full comment

can we see the 'suprising' spreadsheet but with a y/n column for whether the events actually did happen, for ppl whom don't follow the news that closely?

I'm surprised that (if I'm reading it right) people are surprised that Ukraine doesn't control Sevastopol yet, IIRC the war was already a stalemate by the start of 2023 and I'd have thought the readers skew slightly right wing and therefore slightly likely to overestimate Russia's chances

Expand full comment

"Other resolutions that **book** people by surprise" error

Expand full comment

Does anyone know if "macro" public intellectuals (guys like Balaji, zeihan, samo burja) ever participate in stuff like this? If so, how do they do? If not, should we take it as a sign they're not as confident in their proclamations as they claim to be? Or that one year isn't a long enough time horizon for "macro" trends to apply? Or that they don't think it's worth their time?

Expand full comment

Scott, on the ID key spreadsheet there are some email addresses used as the code, which people may not want to be visible. There's also one that has a note as part of the key asking you not to share the rest of their key.

Expand full comment

Dammit I lost my ID key, I can find myself based on my answers in the xls for second round, but not my score in the latest xls. I was quite hopeful I'd have a good score but apparently I'm overestimating myself :D

Expand full comment

The scoring system is the new Metaculus Peer Score? In last year's contest, the aggregate was only at the 84th percentile- I assume some of the improvement is that there were >6x as many forecasters, but does the new scoring system also have properties that make it harder to outperform an average of forecasts?

Expand full comment

I made a spreadsheet to track the performance of the mean and median aggregates against the Manifold probabilities at the time of the Blind mode cutoff. I didn't get that the average did better than Manifold like the above chart shows: I got that the mean-aggregation was the worst, Manifold was in the middle, and the median was best (but these last two were very similar - less than 1:2 likelihood ratio between the median and manifold).

I would be curious to know if I did something wrong here, or if not, what the discrepancy is between how I processed the data and how the chart above was made.

https://docs.google.com/spreadsheets/d/1SQW-QGcMkwTTnguIHHqkn98ru9M1oBnhtmCrz2EMbw4/edit?usp=sharing

Expand full comment

I'd like to see how the rankings vary using different scoring methods... e.g. impute a forecast of 0.5 for missing forecasts and then calculate mean of log score on all questions. I expect a fairly high correlation, but some changes in relative rank among the top finishers.

Expand full comment

I feel like the thing I actually want to compare the best forecasters to is something that I don't know exactly how to mathematically define, and it's something like 'For a data set of this size, if you took the mean response on each question and the variance around each question and everyone guessed a number from that distribution, how good would the best performer from that process do?'

Ie, intuitively: if we model everyone in the study as having the same level of knowledge plus some random biases that shift their answer a bit on each question, and look at the luckiest responses from that distribution whose random variance happened to match reality the closest, how well do they do compared to the best real-world responders?

Basically, I think comparing the best people to 'here's what you get if you put 50% for everything' or even 'here's the average for the whole sample' doesn't mean very much, because you're comparing a population mean to an outlier. It's possible for the best outlier from a large sample to do very very well just by chance, although that becomes a less likely explanation the more times they do well.

So, to say if the top responders had better abilities or just got lucky, you need to ask what score the best outlier from a process like this should produce, if everyone has the same ability plus a plausible amount of random noise.

Expand full comment

I don't get why not just publish results with pseudonyms if people want to look themselves up

can we get (anonymized?) answers for non-blind stage?

Expand full comment

I'm probably not that Ian, and didn't see anything in my spam folder.

Just on the off chance though, the general format of my email is 'ian.[last name][two digit numeral]@gmail.com. Is that at all like the email you have for the mystery Ian?

Expand full comment
Mar 6·edited Mar 6

Was the high negative score on China COVID cases due to blind mode participants not knowing that China had already stopped reporting cases? It seems like a weird question to ask either way.

Expand full comment

Thank you for organizing this, Scott! It was a great pleasure to participate — I ended up writing up a reflective post partly in honor of this tournament https://maxlangenkamp.substack.com/p/on-predictions

Expand full comment

My guess about Small Singapore is that they're some mover and shaker in a tiny territory that aspires to be like Singapore. Well done, anyway.

Expand full comment

Since a bunch of people already asked I simply looked it up: The Metaculus scoring function punishes bad predictions more harshly than it rewards good predictions. This means it's quite easy to score worse than "all 0.5", and is a very simple explanation for the median participant performing worse than "random" - just a single super-confident wrong prediction can put you far behind even if you otherwise make a lot of decent predictions.

Btw, if you think this is bad and should be changed, it seems to be a result of making the scoring "proper", that is, making sure that the true likelihood indeed gets the best score. Without this property, forecasting unlikely events gets the best score as 0% even if the true likelihood is higher than 0%. Which is even more degenerate, so it's probably just a quirk of these scoring systems we'll have to live with.

Expand full comment
Mar 6·edited Mar 6

What do the numbers [0 .. 0.4] labeling the X-axis on the first graph mean?

Expand full comment