I second this. I suppose I used something in my name, but I don't see it. I also don't see any email confirmations, but maybe I just don't know what to search for. Scott can presumably easily find me since there can't be that many Danish 30s year olds in this sample.
Alternatively, he could release a spreadsheet that has all the demographic and survey data next to each score ... I could find myself easily that way as well
What would be cool is emailing everyone their results ... but that would involve a bit of scripting work, which is a lot to ask given the effort already put in
I just added a table of final scores (with the associated predictions) that can be downloaded as a csv on the "Prediction markets" page of the app. However, I was using the Brier score for my analysis, so when I have some more time I'll need to update it to use the Metaculus scoring function that was actually used.
Something looks fishy. The Rank, Percentile and Brier score are all sorted with the best predictors first; but the @1.WillVladimirPutinbePresidentofRussia column is also sorted in descending order. You probably accidentally sorted that column independently of the rest of the table, and so the answers don't correspond to the ranks in the first three columns. At that point it's also worth double-checking that the rest of the columns are actually the predictions associated with the scores.
The first row has a whole lot of 99% and 1% predictions, not what I'd expect from the best predictor. Is that evidence that variance+luck wins, or are they not actually the answers of the winner?
I also noticed what appears to be an interesting trend, of politically conservative people having the strongest of all correlations with getting predictions right. The very highest R absolute value is on the DonaldTrump question, and it is positive. I am not sure which direction the question goes, but if I'm reading the data right, it seems the vast majority of respondents voted 1 or 2, so i'm guessing that's the side that _dislikes_ trump, as I dont think ACX readers are much of trump fans. Similar reasoning goes for the next two highest positive correlations, politicalaffiliation and globalwarming. Are the conservatives getting something right that the rest of us are in the dark about? Or maybe just 2023's questions happened to be relatively easy for conservatives, adn it might be different in different years? Regardless, the highest R value is still less than .22, so it doesn't mean an enormous amount regardless
Superforecasters seek out other views, and if ACX readers lean left, a conservative ACX reader would be doing just that.
Probably helps Scott is a lot nicer to conservatives than most liberals, especially these days.
Interestingly, there are positive correlations with PTSD, use of LSD, BMI, and alcoholism (maybe the Cassandra archetype had something to it?), and negative correlations with SAT scores (math more than verbal), immigration (is that supporting immigration or being an immigrant?), and forecasting experience...and trusting the mainstream media, which I guess makes sense. :)
Tentative explanation for the political correlation: people who read ACX or related sites for the rationalism did better than people who read it for the politics; the former are overwhelmingly liberal or left-wing like people in most intellectual pursuits, while the latter are more mixed.
After I wasted a bunch of time trying to explain correlations that turned out to have a minus sign attached to the way I was reading them (in short, I made the pessimal interpretation), I'm not up for a second try.
I used the Brier score, where a lower score is better (Scott hadn't announced the scoring method that would be used, and it turns out they used the Metaculus scoring function instead). So positive correlations actually mean worse prediction performance. I realize now that I only describe that on the "Simulating outcomes" page of the app--I'll add a brief explanation to the other pages now.
Love this. In the Supervised Aggregation tab, what are the axes on the graphs (the ones that show which features correlate with best Brier)? IIRC the questions from the original survey have a lot of different endpoints so it may not be possible to summarize them concisely here.
I was pretty convinced I didn't have an ID key but I'm 90% sure now I was "correcthorsebatterystaple" like the XKCD comic. If this isn't me and someone else let me know. If it is me though, glad I finished *slightly* positive.
Maybe I'm missing something, but shouldn't Ezra Karger outperforming Metacalculus take into account that he has access to the Metacalculus current state? Much like human + engine outperformed engine for a long time, it's reasonable to expect a skilled individual to improve over the market in predictable rare cases.
I'm glad I'm not the only one who has been reading it as "Metacalculus". Reading the post today was the first time I noticed it was "Metaculus", or its weird Mandela effect
"The Mandela effect: The confusion you feel over being told that [though you don't remember this] you once thought Mandela was dead when he actually wasn't."
That depends on whether the human can recognize when the engine is making better predictions and get out of the way. And also if the human can say "No, I know better" correctly.
This appears to be difficult, and possibly not reliably trainable.
Agreed it's a real and important skill to outperform the engine/market even in small and subtle ways. That said, I know some of the prediction markets are reliably off in consistent ways. Some issues are related to transaction fees making bid-ask spreads bigger than one would think, while others are related to the opportunity cost of waiting until the market clears. Seems one could come up with reliable ways to massage a more accurate reading out of the market. Otoh, saying "the market is usually right but it's being v stupid here" is hard in general, though again not always when some whale is pumping one side or the other.
The completely blind average of all guesses placing 95th percentile is astounding to me, and is the most convincing argument I've ever seen for relying on "wisdom of the crowds" for predictions
Agreed, it's seriously amazing. I wonder to what extent crowds are good at forecasts as opposed to other problems. I would guess estimations, like how far Moskow is from Paris, like in the last survey, would be similarly good, but maybe there are also areas where crowds just aren't as good.
People often claim groups are really bad at thinking. Congresses and collective organizations punish dissent and create group think. How can this fit together with the wisdom of crowds?
Is it as simple as anchoring? If everyone writes down their guess first, a crowd is great, but if people all say their guess one after another, do the later ones anchor on the first ones?
Or, normally, the higher status people predict whatever will make them look best, and then everyone else follows them because of anchoring and because people aren't anonymous in their predictions. So even if they know the high-status person is wrong, they don't want to go against the mainstream?
Or maybe groups share similar beliefs from the start; all the people who love communism come together, so they make bad forecasts because they all think the same from the start?
Or maybe it's just false that groups make bad decisions, and actually they are almost always better than individuals? Maybe political parties seem stupid, but if we were to look at individuals, they would be even worse.
I think it's probably the "doesn't want to go against the high-status person and get punished" and the anchoring thing. And this kind of global forecasting is probably where experts just aren't that good, so crowds are good in comparison.
If I imagine Walmart trying to decide if they should open a new store somewhere or start selling a certain product, would an anonymous poll of all employees outperform the CEO, two data scientists, and three managers just making the decision?
I suspect crowds won't be as good (compared to the experts) in that situation as here because the questions are easier. It is possible to be an expert CEO, but being an expert forecaster is really hard.
In the end, I really don't know. I could imagine the Walmart employees consistently being better than the CEO or worse. More experiments are needed.
"Congresses and collective organizations punish dissent and create group think. How can this fit together with the wisdom of crowds?"
Anchoring, plus many political bodies have whips (it's a person, not an instrument) and other measures to ensure party loyalty. This can include things like your compensation next year (I assume that the head of a committee at least gets a bigger expense account than a private member).
The average pol answers to a large number of people. It's possible that he or she would do better if it was a one term no recall position.
Note that these prediction markets may involve small amounts of money. Getting it wrong (or not going along) doesn't mean you're unemployed next year and unemployable.
I have the file Scott shared of "all submissions" - happy to share! If you're curious you can very likely figure out which submission was yours from the demographic Qs and then see your answers... but without a master answer file I'm not sure how to get from my answers to my score :/
I think just blind but can confirm tomorrow. Don’t have access to anything special - I’m sure Scott shared it at the time of the contest - but I have it handy and can’t recall if I had an ID. Am sure I’m not alone on this one!
I don't even recall what, if any, identifiers I included and certainly didn't save my answers. I also presume that I did 'poorly' -- i answered all questions as fast as I possibly could and intended my contribution to be towards the aggregate data. But now I am both curious and amused that I apparently didn't accurately forecast this consequences of my own actions.
"If you plan to take the ACX Survey, but do NOT agree to include your email on the ACX Survey even with a guarantee that it will be kept private, but you do want to let me correlate your responses here with your responses there - then you can instead generate a password, put it here, save it for two weeks until the ACX Survey comes out, and then put it there. You may want to use https://www.correcthorsebatterystaple.net/index.html to get a fresh password.
"If you don't meet all the criteria in the paragraph above, or if you don't know if you meet them, you can skip this step."
That seemed clear that you only needed to provide a key if you did not include your email. If this was going to be used as a way of distributing results, it would have been nice to let us know. It also seems like a bad idea to release the keys in plain text.
Yeah, the issue is that the guy who I had make the applet that let you plug in your email and see your score last year didn't have time to do that this year, I don't know how to do it myself, and I don't have another good privacy-preserving way to do this. Someone above sort of offered to maybe do this, if they do I'll let people know.
Aha, thank you! I spent forever poring over the keys in certainty that I would've chosen something transparently associated with me, like "dreev" or "beeminder" or "yootles", since I'm perfectly happy for anyone to know how bad I am at prediction. Then I was about to suggest that something had gone wrong with the file since only 257 people seemingly had a key.
So this explains it! Those not concerned about anonymity weren't expected to add a key.
I'm also happy to volunteer to make a privacy-preserving way to get people their scores.
Any chance of releasing an .xls to calculate scores for those of us who have our answers saved (e.g., via the all-submissions file) but didn't include / don't remember their IDs?
Is there an opportunity to stake someone/a group of forecasters money to enable them to bet in prediction markets? (I'm willing to receive a lower risk adjusted return in exchange for helping strengthen prediction markets)
Is anyone else doing this? If so, I'd be curious to know if there are any written details online.
> This means an event was unlikely, but happened anyway.
I understand that, in the long term, "black swan" events like this will average out. If everyone is predicting "business as usual", and you are predicting "asteroid strike" year after year, then if an asteroid strike actually happens you will appear to be a prophet just by luck.
On the other hand, though, "business as usual" is not interesting. Anyone can predict that simply by assuming that tomorrow will be exactly like today (adjusted for inflation); it doesn't take a superforecaster or an AI, just copy/paste. The whole point of prediction markets etc. is that they should be able to accurately predict *unusual* events, isn't it ?
I think the point here is, the event was still unlikely *even given all available information*; it wasn't merely "unusual". If all available information points to an event being unlikely, predicting such an event makes you a bad forecaster, even if it turns out to come true.
So, why is it that the median participant scored worse than random?
Is it mostly explained by the fact that the world is full of people trying to convince people of untrue things? Is it that society collectively constructs a lot of illusions which aren't necessarily intentional? Is it the result of some common property of the particular questions that were asked? Some counterintuitive statistical thing that would show up even in a world where everyone was putting a lot of effort into avoiding bias and nobody was deceptive?
I'll go with counterintuitive statistical things. It's really really hard to viscerally feel the difference between 5% and 0.5%, and between 50% and 70%. Without looking at the data, I guess that people put their probability into the 70-90% range way more often than is rational.
I think this range roughly optimizes "feeling good about your prediction ex-post". If you are right, it doesn't feel like you left much money on the table. If you are wrong, you can claim that you were "unsure".
I wonder if there's a way to teach people to assess the meanings of percentages better and if those people would score higher? Then you would expect that an aggregate of the scores of those people to do better than the aggregates of the scores of everyone.
I expect it's because the scoring rule punishes extreme wrong predictions more than it rewards extreme right predictions, such that getting just a few things very wrong wrecks you. The 50% guesser only ever gains or loses a few points on every question, so it has no risk of taking a big hit like that.
I don't know how many of the participants went for a 'max my variance' strategy (all 1 and 99 predictions), but there were a fair number. Would explain part of it. More generally, even correct use of game theory (trading some correctness for more variance) would tend to reduce your average score even as it, perhaps, increases your chances of a first place finish. Harder to see how much of that was going it, it wouldn't stand out as much, but any significant number of 1 and 99 choices might reflect an attempt to do that.
I mean, 50% isn't really random. We're not talking about a coinflip, we're talking about a long list of independent and disparate questions, with a huge range of unknown actual distributions.
The real question is 'why was 50% across the board such a good answer to give on this battery of prediction questions?'
I assume the answer is something like 'Scott tried to include a good balance of things that seemed likely vs unlikely' or 'an artifact of the scoring system means that 50% loses the least points whenever you are wrong and that can make 50% score higher than your best guesses even if you have some knowledge' or something like that.
"Why would bad forecasters ever beat good forecasters? This means an event was unlikely, but happened anyway."
It can (I think) also mean that the likely outcome happened, but good forecasters were less confident in it than bad forecasters. That seems to have happened to the questions about whether Sevastopol, Luhansk (Russian controlled), or Zaporizhzhia (Ukrainian controlled) would change hands—none of them did, and bad forecasters beat good forecasters on each.
You would need to be able to make a strong case for these outcomes being genuinely uncertain in order for them to be good examples. I'm not sure that's the case here. 'Russia continues to hold Sevastopol' was as I recall one of my highest probability predictions and I don't, and didn't, feel like it was particularly overconfident.
I don't know that they were genuinely uncertain, what I see is that in the table, they each have negative numbers in the last column, which means that bad forecasters beat good forecasters, which I think means that good forecasters were less confident. Why that was the case, I don't know.
It seems like the right way to ask "which events were the most surprising?" is the very simple method of ranking all the events by the difference between their posterior probability (either zero or one) and their average forecast probability. That's what it means for an event to be surprising. I don't understand why Scott presented the mysterious metric instead.
Not quite; if I understood the post correctly, the other metric he presented was "number of people who got the call wrong", with the amount by which they got it wrong not taken into account. ("A more negative number means that more people got the question wrong.")
But to measure how surprising an event is, you actually do want to consider how confident people were that it would happen.
"The first colored column represents average score on each question." That suggests the amount they got it wrong by is taken into account. The next sentence, the one you quoted, is a simplification.
First, we can easily observe that unless confidence is not taken into account, the second sentence is not a simplification; it's just false. There is causality between the two numbers, but it runs in the opposite direction to what is claimed.
The second issue is not so simple - Scott doesn't specify his scoring rule other than to say that it's the Metaculus scoring rule while linking to a page that explains that Metaculus uses multiple scoring rules, and that is vague on how to calculate them.
But if we assume that he's using the log score or the baseline score, which is stated to be a rescaling of the log score, then it would be true that "amount of wrongness" would be taken into account, in an obscure way, but it would be false that Scott presented the obvious metric I suggested. The average log score on a question does not follow the same ordering as the average probability assigned to the question (that is, it is not the case that when one question has a higher average log score than another question, it also has a higher average assigned probability than that other question) and is therefore not even capable of answering "which events were the most surprising?" according to the average-probability-assigned metric. It is even _less_ capable of measuring _how_ surprising events were according to that metric.
And Scott has done this even though he's also noted explicitly that he calculated average assigned probability for every question. Why?
The score is a device intended to elicit honest probability estimates. It is not itself a measure of likelihood - that's what the probability estimate is! To measure surprisingness, you want to use the probability estimates.
"First, we can easily observe that unless confidence is not taken into account, the second sentence is not a simplification; it's just false. There is causality between the two numbers, but it runs in the opposite direction to what is claimed."
That would be the case for Brier score, but I assume not for the Metaculus score. I assume Scott included the second sentence to make it clear in what direction the correlation between the score and accuracy goes.
And more to the point - 'good' and 'bad' forecasters in this analysis are only relative to this specific list of questions.
It's totally possible that the questions fall into clusters where one group of people or another is more likely to be right. Like, it could be political vs economic questions, or red vs blue tribe issues, or etc.
Example, if 70% of the questions were mostly political and 30% were mostly economic, then people who mostly follow and know about politics would be 'good' forecasters' and people who mostly follow and know about economics would be 'bad' forecasters.
In that case, we'd be saying 'bad forecasters beat good forecasters on all these questions about the economy! Did the economy do some crazy surprising thing this year that no one could have expected?'
No, you just implicitly defined 'bad forecasters' as 'people who mostly know about the economy', it's not weird that they were most accurate on those questions.
This is example #930485 of how this stuff is all actually really complicated and hard to talk about, and it's easy to follow your intuitions off a cliff if you're not careful.
Well, specifically when talking about outliers, since we're comparing best vs worst.
I think this makes sense... there could be a general factor of 'education or 'intelligence' or w/e across the entire world population that makes people who know more than average about politics also know more than average about economics.
But limiting the sample to readers of this blog who want to participate in a prediction markets experiment probably flattens most of that variance already. After which, it seems pretty reasonable to expect that among the *best* scorers on politics vs the *best* scorers on economics, those outliers would be people with a particular dedication to one topic that draws against similarly-extreme positive performance on the other.
(maybe economics and politics are bad examples because they're intertwined, it could be something like 'politics vs celebrities' or 'red tribe sacred cows vs blue tribe sacred cows' or 'STEM major type of questions vs Humanities major types of questions' or etc, depending on the actual list of questions)
You should also ask a fun essay question: What is the most surprising thing that will happen in this coming year?
Granted, there's no way to objectively evaluate the responses, and it would be tedious to read through a zillion wild guesses. But so what? A year later, just let people grade themselves in the comments section by evaluating their predictions.
Have there been any studies of five or ten year predictions? It seems like Tetlock's superforecasters have pretty much cracked the code on how to be above average on 12-month predictions (e.g., just because you are pretty sure something will eventually happen doesn't mean it will happen in the next year). But it could be that 5 or 10 year long forecasts require a different set of techniques.
There are lots of email addresses in the Idkeys file, not sure if those folks realized what they were doing. @Scott Feel free to delete this comment if you were hoping for security through obscurity.
I used my email fully expecting to be able to see my score... and so did the 4 friends I convinced to do this.
Pretty please can we get some version that includes the emails? You could hash the emails with high collision (will email proposed solution and get review)
Congratulations to the winners 😊
How do we see our results if we didn't do an ID key? Can't remember if I did that...
I second this. I suppose I used something in my name, but I don't see it. I also don't see any email confirmations, but maybe I just don't know what to search for. Scott can presumably easily find me since there can't be that many Danish 30s year olds in this sample.
Alternatively, he could release a spreadsheet that has all the demographic and survey data next to each score ... I could find myself easily that way as well
I'm in the same situation as the only Tatar in the dataset :D
What would be cool is emailing everyone their results ... but that would involve a bit of scripting work, which is a lot to ask given the effort already put in
I and other programmers who really want our score will probably happily oblige.
Send me an email at scott@slatestarcodex.com if you actually want to do this.
Thanks for doing (all) this
please, this would be amazing.
Ok! I have a working solution. Waiting on Scotts approval so I can say "Hold onto your butts" as I press enter and send 3700 emails.
Hi! Any news on Scott's approval?
Should have done an ID key (but you were 193rd in Blind and 23rd in Full).
Ha, meant to thank you here, but just general thanks everywhere I guess
I created a potential solution here. Discussing it so we can know if it's a bad idea.
https://www.lesswrong.com/posts/znwEWBwHkpMfAKKCB/making-2023-acx-prediction-results-public
Isn't it safer and better to just e-mail everybody his score (and, if we're at it, his answers as well)?
yes! but it seems substantially harder/out of my area. And IMO my solution is about as safe as the applet from previous years.
If the email option is the only acceptable solution I can probably figure it out within the next week.
I also did some in-depth analysis of the Blind Mode data and turned the results into an interactive app at https://jbreffle.github.io/acx-app
The associated code is available at https://github.com/jbreffle/acx-prediction-contest
I still have a few final analyses left to add to the app, but all of the results can be viewed in the notebooks in the repo.
this stuff is great! do you have a spreadsheet with all the individual entries ~graded~ by chance?
I just added a table of final scores (with the associated predictions) that can be downloaded as a csv on the "Prediction markets" page of the app. However, I was using the Brier score for my analysis, so when I have some more time I'll need to update it to use the Metaculus scoring function that was actually used.
Something looks fishy. The Rank, Percentile and Brier score are all sorted with the best predictors first; but the @1.WillVladimirPutinbePresidentofRussia column is also sorted in descending order. You probably accidentally sorted that column independently of the rest of the table, and so the answers don't correspond to the ranks in the first three columns. At that point it's also worth double-checking that the rest of the columns are actually the predictions associated with the scores.
The first row has a whole lot of 99% and 1% predictions, not what I'd expect from the best predictor. Is that evidence that variance+luck wins, or are they not actually the answers of the winner?
Oops, you're right. I made a silly mistake when I made that table this morning. Will fix later.
I guess it is fixed now?
Yep! But I still need to implement the Metaculus scoring function.
Interesting that there seems to be a bimodal distro in a lot of the covid questions - not just with casuals, but even with the superforecasters!
I also noticed what appears to be an interesting trend, of politically conservative people having the strongest of all correlations with getting predictions right. The very highest R absolute value is on the DonaldTrump question, and it is positive. I am not sure which direction the question goes, but if I'm reading the data right, it seems the vast majority of respondents voted 1 or 2, so i'm guessing that's the side that _dislikes_ trump, as I dont think ACX readers are much of trump fans. Similar reasoning goes for the next two highest positive correlations, politicalaffiliation and globalwarming. Are the conservatives getting something right that the rest of us are in the dark about? Or maybe just 2023's questions happened to be relatively easy for conservatives, adn it might be different in different years? Regardless, the highest R value is still less than .22, so it doesn't mean an enormous amount regardless
Superforecasters seek out other views, and if ACX readers lean left, a conservative ACX reader would be doing just that.
Probably helps Scott is a lot nicer to conservatives than most liberals, especially these days.
Interestingly, there are positive correlations with PTSD, use of LSD, BMI, and alcoholism (maybe the Cassandra archetype had something to it?), and negative correlations with SAT scores (math more than verbal), immigration (is that supporting immigration or being an immigrant?), and forecasting experience...and trusting the mainstream media, which I guess makes sense. :)
Sorry for the confusion, but a positive correlation actually means worse prediction performance--see my other comment in this thread.
So it's the standard 'liberals and smart and healthy people do better'. Lame!
Tentative explanation for the political correlation: people who read ACX or related sites for the rationalism did better than people who read it for the politics; the former are overwhelmingly liberal or left-wing like people in most intellectual pursuits, while the latter are more mixed.
After I wasted a bunch of time trying to explain correlations that turned out to have a minus sign attached to the way I was reading them (in short, I made the pessimal interpretation), I'm not up for a second try.
sometimes, the truth is difficult, painful, or horrifying
sometimes it's just boring
Indeed, you would expect it to be boring much more often than not, since deviating from expectations is what makes it not boring.
I used the Brier score, where a lower score is better (Scott hadn't announced the scoring method that would be used, and it turns out they used the Metaculus scoring function instead). So positive correlations actually mean worse prediction performance. I realize now that I only describe that on the "Simulating outcomes" page of the app--I'll add a brief explanation to the other pages now.
Ooops! My bad. I should have read the rest of the sections more carefully I think. But that explains a lot lol
I'm guessing that the big pro-conservative surprises were Trump getting indicted and Ukraine's offensive failing.
This is really interesting, thanks!
Love this. In the Supervised Aggregation tab, what are the axes on the graphs (the ones that show which features correlate with best Brier)? IIRC the questions from the original survey have a lot of different endpoints so it may not be possible to summarize them concisely here.
I should try to add those labels back in to the x-axis where it is feasible.
This function https://github.com/jbreffle/acx-prediction-contest/blob/f192fa2e096617c2ea6d18fa380ad7264cdfe8e9/src/process.py#L64C5-L64C19 is where I define the feature values. Many of the survey questions were just ratings on a scale of 1 to 5. For defining medical diagnoses check the variable diagnosis_dict. I also did some feature engineering for some of the more complex survey questions.
The Blind Mode raw data thing on the main page throws an error when opening it.
Thanks! I'll try to figure out what's going wrong.
This is awesome. Any idea where we can download the raw data for "full mode"?
He hasn't made that available, as far as I know.
Another problem: in light mode, the "Predictions by experience" charts show the "All participants" line in white on white background.
Thanks, I updated the theme. I hadn't considered the plight of light-mode users.
I was pretty convinced I didn't have an ID key but I'm 90% sure now I was "correcthorsebatterystaple" like the XKCD comic. If this isn't me and someone else let me know. If it is me though, glad I finished *slightly* positive.
Maybe I'm missing something, but shouldn't Ezra Karger outperforming Metacalculus take into account that he has access to the Metacalculus current state? Much like human + engine outperformed engine for a long time, it's reasonable to expect a skilled individual to improve over the market in predictable rare cases.
That's a good point, thanks (although most other Full Mode participants weren't able to outperform Metaculus).
I'm glad I'm not the only one who has been reading it as "Metacalculus". Reading the post today was the first time I noticed it was "Metaculus", or its weird Mandela effect
May I?
"The Mandela effect: The confusion you feel over being told that [though you don't remember this] you once thought Mandela was dead when he actually wasn't."
That depends on whether the human can recognize when the engine is making better predictions and get out of the way. And also if the human can say "No, I know better" correctly.
This appears to be difficult, and possibly not reliably trainable.
Agreed it's a real and important skill to outperform the engine/market even in small and subtle ways. That said, I know some of the prediction markets are reliably off in consistent ways. Some issues are related to transaction fees making bid-ask spreads bigger than one would think, while others are related to the opportunity cost of waiting until the market clears. Seems one could come up with reliable ways to massage a more accurate reading out of the market. Otoh, saying "the market is usually right but it's being v stupid here" is hard in general, though again not always when some whale is pumping one side or the other.
The completely blind average of all guesses placing 95th percentile is astounding to me, and is the most convincing argument I've ever seen for relying on "wisdom of the crowds" for predictions
Agreed, it's seriously amazing. I wonder to what extent crowds are good at forecasts as opposed to other problems. I would guess estimations, like how far Moskow is from Paris, like in the last survey, would be similarly good, but maybe there are also areas where crowds just aren't as good.
People often claim groups are really bad at thinking. Congresses and collective organizations punish dissent and create group think. How can this fit together with the wisdom of crowds?
Is it as simple as anchoring? If everyone writes down their guess first, a crowd is great, but if people all say their guess one after another, do the later ones anchor on the first ones?
Or, normally, the higher status people predict whatever will make them look best, and then everyone else follows them because of anchoring and because people aren't anonymous in their predictions. So even if they know the high-status person is wrong, they don't want to go against the mainstream?
Or maybe groups share similar beliefs from the start; all the people who love communism come together, so they make bad forecasts because they all think the same from the start?
Or maybe it's just false that groups make bad decisions, and actually they are almost always better than individuals? Maybe political parties seem stupid, but if we were to look at individuals, they would be even worse.
I think it's probably the "doesn't want to go against the high-status person and get punished" and the anchoring thing. And this kind of global forecasting is probably where experts just aren't that good, so crowds are good in comparison.
If I imagine Walmart trying to decide if they should open a new store somewhere or start selling a certain product, would an anonymous poll of all employees outperform the CEO, two data scientists, and three managers just making the decision?
I suspect crowds won't be as good (compared to the experts) in that situation as here because the questions are easier. It is possible to be an expert CEO, but being an expert forecaster is really hard.
In the end, I really don't know. I could imagine the Walmart employees consistently being better than the CEO or worse. More experiments are needed.
"Congresses and collective organizations punish dissent and create group think. How can this fit together with the wisdom of crowds?"
Anchoring, plus many political bodies have whips (it's a person, not an instrument) and other measures to ensure party loyalty. This can include things like your compensation next year (I assume that the head of a committee at least gets a bigger expense account than a private member).
The average pol answers to a large number of people. It's possible that he or she would do better if it was a one term no recall position.
Note that these prediction markets may involve small amounts of money. Getting it wrong (or not going along) doesn't mean you're unemployed next year and unemployable.
Someone should look at whether medians are even better, as they are less influenced by extreme guesses of idiots.
Dang I don't think I gave an ID Key just my email. Have the original form saved as a screenshot too but not my answers. Curious how I did.
Same!!!
I have the file Scott shared of "all submissions" - happy to share! If you're curious you can very likely figure out which submission was yours from the demographic Qs and then see your answers... but without a master answer file I'm not sure how to get from my answers to my score :/
do you have the file for both blind and full mode? or just blind mode?
I think just blind but can confirm tomorrow. Don’t have access to anything special - I’m sure Scott shared it at the time of the contest - but I have it handy and can’t recall if I had an ID. Am sure I’m not alone on this one!
oh nvm! I have that as well
I don't even recall what, if any, identifiers I included and certainly didn't save my answers. I also presume that I did 'poorly' -- i answered all questions as fast as I possibly could and intended my contribution to be towards the aggregate data. But now I am both curious and amused that I apparently didn't accurately forecast this consequences of my own actions.
The text read:
"If you plan to take the ACX Survey, but do NOT agree to include your email on the ACX Survey even with a guarantee that it will be kept private, but you do want to let me correlate your responses here with your responses there - then you can instead generate a password, put it here, save it for two weeks until the ACX Survey comes out, and then put it there. You may want to use https://www.correcthorsebatterystaple.net/index.html to get a fresh password.
"If you don't meet all the criteria in the paragraph above, or if you don't know if you meet them, you can skip this step."
That seemed clear that you only needed to provide a key if you did not include your email. If this was going to be used as a way of distributing results, it would have been nice to let us know. It also seems like a bad idea to release the keys in plain text.
Yeah, the issue is that the guy who I had make the applet that let you plug in your email and see your score last year didn't have time to do that this year, I don't know how to do it myself, and I don't have another good privacy-preserving way to do this. Someone above sort of offered to maybe do this, if they do I'll let people know.
Aha, thank you! I spent forever poring over the keys in certainty that I would've chosen something transparently associated with me, like "dreev" or "beeminder" or "yootles", since I'm perfectly happy for anyone to know how bad I am at prediction. Then I was about to suggest that something had gone wrong with the file since only 257 people seemingly had a key.
So this explains it! Those not concerned about anonymity weren't expected to add a key.
I'm also happy to volunteer to make a privacy-preserving way to get people their scores.
Any chance of releasing an .xls to calculate scores for those of us who have our answers saved (e.g., via the all-submissions file) but didn't include / don't remember their IDs?
Pursuant to my post here: https://old.reddit.com/r/slatestarcodex/comments/1b6akix/what_are_your_favourite_ways_to_combine/
Is there an opportunity to stake someone/a group of forecasters money to enable them to bet in prediction markets? (I'm willing to receive a lower risk adjusted return in exchange for helping strengthen prediction markets)
Is anyone else doing this? If so, I'd be curious to know if there are any written details online.
> This means an event was unlikely, but happened anyway.
I understand that, in the long term, "black swan" events like this will average out. If everyone is predicting "business as usual", and you are predicting "asteroid strike" year after year, then if an asteroid strike actually happens you will appear to be a prophet just by luck.
On the other hand, though, "business as usual" is not interesting. Anyone can predict that simply by assuming that tomorrow will be exactly like today (adjusted for inflation); it doesn't take a superforecaster or an AI, just copy/paste. The whole point of prediction markets etc. is that they should be able to accurately predict *unusual* events, isn't it ?
Depends how valuable/actionable knowing about a given event in advance is.
Agreed; I was thinking of important events, such as the aforementioned asteroid strike.
I think the point here is, the event was still unlikely *even given all available information*; it wasn't merely "unusual". If all available information points to an event being unlikely, predicting such an event makes you a bad forecaster, even if it turns out to come true.
Instead of "50% on everything", I'd use the binomial distribution for probability density, or something similar, and sample randomly from that.
Is there any scoring system where you would choose 50% on everything?
Why would that be better than 50% on everything? It would give your score more variance, but probably make it worse overall...
I successfully managed to figure out where I kept my ID key, but it doesn't appear on the list :(
Same here. Scott, are you sure the attached file is complete?
It should be. I'm hoping I can get someone to help me figure out a better method to get people their scores.
Any chance you can include a ranking as part of the ID keys as well? I seem to have done pretty well and want to stroke my ego further...
So, why is it that the median participant scored worse than random?
Is it mostly explained by the fact that the world is full of people trying to convince people of untrue things? Is it that society collectively constructs a lot of illusions which aren't necessarily intentional? Is it the result of some common property of the particular questions that were asked? Some counterintuitive statistical thing that would show up even in a world where everyone was putting a lot of effort into avoiding bias and nobody was deceptive?
I'll go with counterintuitive statistical things. It's really really hard to viscerally feel the difference between 5% and 0.5%, and between 50% and 70%. Without looking at the data, I guess that people put their probability into the 70-90% range way more often than is rational.
I think this range roughly optimizes "feeling good about your prediction ex-post". If you are right, it doesn't feel like you left much money on the table. If you are wrong, you can claim that you were "unsure".
I wonder if there's a way to teach people to assess the meanings of percentages better and if those people would score higher? Then you would expect that an aggregate of the scores of those people to do better than the aggregates of the scores of everyone.
I expect it's because the scoring rule punishes extreme wrong predictions more than it rewards extreme right predictions, such that getting just a few things very wrong wrecks you. The 50% guesser only ever gains or loses a few points on every question, so it has no risk of taking a big hit like that.
(Epistemic status: I didn't fully understand the linked Metaculus page https://www.metaculus.com/help/scoring/)
I don't know how many of the participants went for a 'max my variance' strategy (all 1 and 99 predictions), but there were a fair number. Would explain part of it. More generally, even correct use of game theory (trading some correctness for more variance) would tend to reduce your average score even as it, perhaps, increases your chances of a first place finish. Harder to see how much of that was going it, it wouldn't stand out as much, but any significant number of 1 and 99 choices might reflect an attempt to do that.
I mean, 50% isn't really random. We're not talking about a coinflip, we're talking about a long list of independent and disparate questions, with a huge range of unknown actual distributions.
The real question is 'why was 50% across the board such a good answer to give on this battery of prediction questions?'
I assume the answer is something like 'Scott tried to include a good balance of things that seemed likely vs unlikely' or 'an artifact of the scoring system means that 50% loses the least points whenever you are wrong and that can make 50% score higher than your best guesses even if you have some knowledge' or something like that.
Would 45% across the board, or 59%, yield a better percentile? What would that say about the question bank?
To help find Small Singapore, I forwarded the screenshot to our local @LessWrong_Singapore group on Telegram.
Btw, how can I get into the group? I'm in Singapore, too.
Congrats to those who won! I'm just glad I did a bit better than the median and better than just guessing 50% on everything.
"Why would bad forecasters ever beat good forecasters? This means an event was unlikely, but happened anyway."
It can (I think) also mean that the likely outcome happened, but good forecasters were less confident in it than bad forecasters. That seems to have happened to the questions about whether Sevastopol, Luhansk (Russian controlled), or Zaporizhzhia (Ukrainian controlled) would change hands—none of them did, and bad forecasters beat good forecasters on each.
You would need to be able to make a strong case for these outcomes being genuinely uncertain in order for them to be good examples. I'm not sure that's the case here. 'Russia continues to hold Sevastopol' was as I recall one of my highest probability predictions and I don't, and didn't, feel like it was particularly overconfident.
I don't know that they were genuinely uncertain, what I see is that in the table, they each have negative numbers in the last column, which means that bad forecasters beat good forecasters, which I think means that good forecasters were less confident. Why that was the case, I don't know.
It seems like the right way to ask "which events were the most surprising?" is the very simple method of ranking all the events by the difference between their posterior probability (either zero or one) and their average forecast probability. That's what it means for an event to be surprising. I don't understand why Scott presented the mysterious metric instead.
He presented both metrics.
Not quite; if I understood the post correctly, the other metric he presented was "number of people who got the call wrong", with the amount by which they got it wrong not taken into account. ("A more negative number means that more people got the question wrong.")
But to measure how surprising an event is, you actually do want to consider how confident people were that it would happen.
"The first colored column represents average score on each question." That suggests the amount they got it wrong by is taken into account. The next sentence, the one you quoted, is a simplification.
First, we can easily observe that unless confidence is not taken into account, the second sentence is not a simplification; it's just false. There is causality between the two numbers, but it runs in the opposite direction to what is claimed.
The second issue is not so simple - Scott doesn't specify his scoring rule other than to say that it's the Metaculus scoring rule while linking to a page that explains that Metaculus uses multiple scoring rules, and that is vague on how to calculate them.
But if we assume that he's using the log score or the baseline score, which is stated to be a rescaling of the log score, then it would be true that "amount of wrongness" would be taken into account, in an obscure way, but it would be false that Scott presented the obvious metric I suggested. The average log score on a question does not follow the same ordering as the average probability assigned to the question (that is, it is not the case that when one question has a higher average log score than another question, it also has a higher average assigned probability than that other question) and is therefore not even capable of answering "which events were the most surprising?" according to the average-probability-assigned metric. It is even _less_ capable of measuring _how_ surprising events were according to that metric.
And Scott has done this even though he's also noted explicitly that he calculated average assigned probability for every question. Why?
The score is a device intended to elicit honest probability estimates. It is not itself a measure of likelihood - that's what the probability estimate is! To measure surprisingness, you want to use the probability estimates.
"First, we can easily observe that unless confidence is not taken into account, the second sentence is not a simplification; it's just false. There is causality between the two numbers, but it runs in the opposite direction to what is claimed."
That would be the case for Brier score, but I assume not for the Metaculus score. I assume Scott included the second sentence to make it clear in what direction the correlation between the score and accuracy goes.
And more to the point - 'good' and 'bad' forecasters in this analysis are only relative to this specific list of questions.
It's totally possible that the questions fall into clusters where one group of people or another is more likely to be right. Like, it could be political vs economic questions, or red vs blue tribe issues, or etc.
Example, if 70% of the questions were mostly political and 30% were mostly economic, then people who mostly follow and know about politics would be 'good' forecasters' and people who mostly follow and know about economics would be 'bad' forecasters.
In that case, we'd be saying 'bad forecasters beat good forecasters on all these questions about the economy! Did the economy do some crazy surprising thing this year that no one could have expected?'
No, you just implicitly defined 'bad forecasters' as 'people who mostly know about the economy', it's not weird that they were most accurate on those questions.
This is example #930485 of how this stuff is all actually really complicated and hard to talk about, and it's easy to follow your intuitions off a cliff if you're not careful.
This particular point would apply if attention to different topics is anti-correlated, at least among people who participate in the contest.
Well, specifically when talking about outliers, since we're comparing best vs worst.
I think this makes sense... there could be a general factor of 'education or 'intelligence' or w/e across the entire world population that makes people who know more than average about politics also know more than average about economics.
But limiting the sample to readers of this blog who want to participate in a prediction markets experiment probably flattens most of that variance already. After which, it seems pretty reasonable to expect that among the *best* scorers on politics vs the *best* scorers on economics, those outliers would be people with a particular dedication to one topic that draws against similarly-extreme positive performance on the other.
(maybe economics and politics are bad examples because they're intertwined, it could be something like 'politics vs celebrities' or 'red tribe sacred cows vs blue tribe sacred cows' or 'STEM major type of questions vs Humanities major types of questions' or etc, depending on the actual list of questions)
One of the questions asks about a deepfake attempt that made front page news, and it resolved yes. Could someone link the story?
Dunno what was decisive for Scott, but there are a bunch of examples linked in the comments of the Manifold market: https://manifold.markets/ACXBot/47-will-a-successful-deepfake-attem
You could release a redacted version of Small Singapore's email. Like **anderson*****@gmail.com
Their email was SmallSingapore@[redacted].com
You should also ask a fun essay question: What is the most surprising thing that will happen in this coming year?
Granted, there's no way to objectively evaluate the responses, and it would be tedious to read through a zillion wild guesses. But so what? A year later, just let people grade themselves in the comments section by evaluating their predictions.
Have there been any studies of five or ten year predictions? It seems like Tetlock's superforecasters have pretty much cracked the code on how to be above average on 12-month predictions (e.g., just because you are pretty sure something will eventually happen doesn't mean it will happen in the next year). But it could be that 5 or 10 year long forecasts require a different set of techniques.
That does sound cool.
There are lots of email addresses in the Idkeys file, not sure if those folks realized what they were doing. @Scott Feel free to delete this comment if you were hoping for security through obscurity.
There's also one (not an email address) that includes the tag "(pw; do not show nor my email)"
I used my email fully expecting to be able to see my score... and so did the 4 friends I convinced to do this.
Pretty please can we get some version that includes the emails? You could hash the emails with high collision (will email proposed solution and get review)
+1 to this