Astral Codex Ten

Comment removed

Expand full comment

Robert Jones

The second column is 1 if the event happened and 0 if it didn't.

Expand full comment

Mar 5, 2024Edited

People leaned in the right direction on that none of the three cities changed hands, but they weren't very confident on any of them, and good predictors were less confident than bad predictors. Ukraine regained ground in the Kharkhiv region and Kherson in the fall of 2022, perhaps that's why people expected major breakthroughs to be plausible (in either direction—out of the three cities, two were and remain controlled by Russia and one by Ukraine).

ACX readers skew left-wing: https://www.astralcodexten.com/p/acx-survey-results-2022, search for "POLITICS" in https://docs.google.com/forms/d/e/1FAIpQLScHznuYU9nWqDyNvZ8fQySdWHk5rrj2IdEDMgarf3s34bSPrA/viewanalytics

Expand full comment

Mar 6, 2024Edited

At the start of 2023, Ukraine was planning a big summer offensive with all the new western equipment, and it seemed reasonable to think that it might achieve a breakthrough. Keep in mind that that was also just four months after Russia's epic collapse in Kharkiv.

Expand full comment

People indeed seem to have overestimated the likelihood of a breakthrough. On average, they estimated a 33% chance that Russia captures Zaporizhzhia, 44% that Ukraine captures Luhansk, and 34% that it captures Sevastopol, with the average of the bigger of the latter two (which is a lower bound on one's estimate that Ukraine captures at least one of them) being 47% (medians 30%, 40%, 25% and 50% respectively).

If we assume a near zero probability that *both* sides capture a major city and that remains the status quo at the end of the year, the sum of Zaporizhzhia and the bigger of Luhansk and Sevastopol is a lower bound on the estimate that there is a major breakthrough by a side. The mean of this is 80% (median 85%); based on just Zaporizhzhia and Luhansk, we get a 77% mean probability of a breakthrough (median 80%). And that's even though people, on average, considered the two sides about equally strong; if people expected one side to be clearly in a stronger position, it might have been reasonable to predict an imminent breakthrough, but if they were similarly strong, a stalemate seems like the right prediction. And while the two sides trading cities is not technically impossible, a major breakthrough also wouldn't necessarily involve one of these cities.

Perhaps some people didn't realize that Ukraine controlled Zaporizhzhia, and thought they were predicting a stalemate when they thought Russia would control it. Russia did and does control a large part of Zaporizhzhia oblast, including the Zaporizhzhia Nuclear Power Plant, and the media often didn't clarify between the city and the oblast. But, again, generally good forecasters performed worse on all three individual question, and one would expect that they looked up basic facts even in blind mode if they didn't follow the events enough to know them.

Did people consider the two sides to be of similar strength, or did they think one was clearly in a stronger position, just disagreed which one? People estimated a 23% chance that the side they considered weaker would capture a city, and 57% for the stronger side (medians 20% and 60%). (The weaker is whichever is less likely to capture one of the cities from the enemy; in Ukraine's case, I still took the bigger of Luhansk and Sevastopol.) The correlation between Russia capturing Zhaporizhzhia, and the bigger of Ukraine capturing one of the two cities, is -0.34, meaning that people's belief about which side was stronger was more important in determining their estimates than their tendency to expect mobile vs. static warfare, or their general tendency to make confident predictions.

Looking at lizardman predictions: 14% of those who made predictions about both Zaporizhzhia and at least one of Luhansk and Sevastopol thought Ukraine was more likely to capture (a specific one) of the latter two than to keep Zaporizhzhia, necessarily implying a >0 probability of the two sides trading cities. 15% of those who made predictions about Luhansk and Sevastopol thought Ukraine was more likely to capture Sevastopol. 21% of those who made predictions on all three cities made at least one of these unlikely predictions. 59% of these people answered all 50 questions vs. 51% of all participants; many people didn't realize they could skip questions if they hadn't a clue.

(Based on Jordan Breffle's (hopefully now correct) table, and my 20/20 hindsight.)

Expand full comment

Mar 6, 2024Edited

Actually, using Jordan Breffle's table, I get a positive correlation between Brier score (the lower the better someone's predictions were) and the estimates of the probabilities of each city getting captured (in Zaporizhzhia's case, subtracting the answer from 100), which suggests that better predictors were also better on these. Either something is still wrong with Breffle's table, or something is wrong with Scott's table, or it's a discrepancy between Brier score and the Metaculus score (unlikely, when plotting a moving average of those answers against Brier scores, it's very clear that better predictors were better on these questions), or I misunderstand something or made a mistake somewhere. EDIT: or Scott's table is based on full mode predictions, while Breffle's data are from blind mode ones, or something like that.

Expand full comment

Comment removed

Comment removed

Expand full comment

Zach Gilfix

Same!!!

Expand full comment

JR11

I have the file Scott shared of "all submissions" - happy to share! If you're curious you can very likely figure out which submission was yours from the demographic Qs and then see your answers... but without a master answer file I'm not sure how to get from my answers to my score :/

Expand full comment

Ben

do you have the file for both blind and full mode? or just blind mode?

Expand full comment

JR11

I think just blind but can confirm tomorrow. Don’t have access to anything special - I’m sure Scott shared it at the time of the contest - but I have it handy and can’t recall if I had an ID. Am sure I’m not alone on this one!

Expand full comment

Ben

oh nvm! I have that as well

Expand full comment

Leo Abstract

I don't even recall what, if any, identifiers I included and certainly didn't save my answers. I also presume that I did 'poorly' -- i answered all questions as fast as I possibly could and intended my contribution to be towards the aggregate data. But now I am both curious and amused that I apparently didn't accurately forecast this consequences of my own actions.

Expand full comment

Robert Jones

Mar 5, 2024Edited

The text read:

"If you plan to take the ACX Survey, but do NOT agree to include your email on the ACX Survey even with a guarantee that it will be kept private, but you do want to let me correlate your responses here with your responses there - then you can instead generate a password, put it here, save it for two weeks until the ACX Survey comes out, and then put it there. You may want to use https://www.correcthorsebatterystaple.net/index.html to get a fresh password.

"If you don't meet all the criteria in the paragraph above, or if you don't know if you meet them, you can skip this step."

That seemed clear that you only needed to provide a key if you did not include your email. If this was going to be used as a way of distributing results, it would have been nice to let us know. It also seems like a bad idea to release the keys in plain text.

Expand full comment

Yeah, the issue is that the guy who I had make the applet that let you plug in your email and see your score last year didn't have time to do that this year, I don't know how to do it myself, and I don't have another good privacy-preserving way to do this. Someone above sort of offered to maybe do this, if they do I'll let people know.

Expand full comment

Daniel Reeves

Aha, thank you! I spent forever poring over the keys in certainty that I would've chosen something transparently associated with me, like "dreev" or "beeminder" or "yootles", since I'm perfectly happy for anyone to know how bad I am at prediction. Then I was about to suggest that something had gone wrong with the file since only 257 people seemingly had a key.

So this explains it! Those not concerned about anonymity weren't expected to add a key.

I'm also happy to volunteer to make a privacy-preserving way to get people their scores.

Expand full comment

Bianca Dămoc

Congratulations to the winners 😊

Expand full comment

How do we see our results if we didn't do an ID key? Can't remember if I did that...

Expand full comment

Emil O. W. Kirkegaard

I second this. I suppose I used something in my name, but I don't see it. I also don't see any email confirmations, but maybe I just don't know what to search for. Scott can presumably easily find me since there can't be that many Danish 30s year olds in this sample.

Expand full comment

Alternatively, he could release a spreadsheet that has all the demographic and survey data next to each score ... I could find myself easily that way as well

Expand full comment

E Dincer

I'm in the same situation as the only Tatar in the dataset :D

Expand full comment

What would be cool is emailing everyone their results ... but that would involve a bit of scripting work, which is a lot to ask given the effort already put in

Expand full comment

I and other programmers who really want our score will probably happily oblige.

Expand full comment

Send me an email at scott@slatestarcodex.com if you actually want to do this.

Expand full comment

Thanks for doing (all) this

Expand full comment

anish

please, this would be amazing.

Expand full comment

Ok! I have a working solution. Waiting on Scotts approval so I can say "Hold onto your butts" as I press enter and send 3700 emails.

Expand full comment

Taleuntum

Mar 10, 2024

Hi! Any news on Scott's approval?

Expand full comment

Should have done an ID key (but you were 193rd in Blind and 23rd in Full).

Expand full comment

Ha, meant to thank you here, but just general thanks everywhere I guess

Expand full comment

Legionaire

https://www.lesswrong.com/posts/znwEWBwHkpMfAKKCB/making-2023-acx-prediction-results-public

I created a potential solution here. Discussing it so we can know if it's a bad idea.

Expand full comment

Isn't it safer and better to just e-mail everybody his score (and, if we're at it, his answers as well)?

Expand full comment

Mar 6, 2024Edited

yes! but it seems substantially harder/out of my area. And IMO my solution is about as safe as the applet from previous years.

If the email option is the only acceptable solution I can probably figure it out within the next week.

Expand full comment

I also did some in-depth analysis of the Blind Mode data and turned the results into an interactive app at https://jbreffle.github.io/acx-app

The associated code is available at https://github.com/jbreffle/acx-prediction-contest

I still have a few final analyses left to add to the app, but all of the results can be viewed in the notebooks in the repo.

Expand full comment

Reply (8)

Ben

this stuff is great! do you have a spreadsheet with all the individual entries ~graded~ by chance?

Expand full comment

I just added a table of final scores (with the associated predictions) that can be downloaded as a csv on the "Prediction markets" page of the app. However, I was using the Brier score for my analysis, so when I have some more time I'll need to update it to use the Metaculus scoring function that was actually used.

Expand full comment

Something looks fishy. The Rank, Percentile and Brier score are all sorted with the best predictors first; but the @1.WillVladimirPutinbePresidentofRussia column is also sorted in descending order. You probably accidentally sorted that column independently of the rest of the table, and so the answers don't correspond to the ranks in the first three columns. At that point it's also worth double-checking that the rest of the columns are actually the predictions associated with the scores.

The first row has a whole lot of 99% and 1% predictions, not what I'd expect from the best predictor. Is that evidence that variance+luck wins, or are they not actually the answers of the winner?

Expand full comment

Oops, you're right. I made a silly mistake when I made that table this morning. Will fix later.

Expand full comment

I guess it is fixed now?

Expand full comment

Yep! But I still need to implement the Metaculus scoring function.

Expand full comment

Interesting that there seems to be a bimodal distro in a lot of the covid questions - not just with casuals, but even with the superforecasters!

Expand full comment

I also noticed what appears to be an interesting trend, of politically conservative people having the strongest of all correlations with getting predictions right. The very highest R absolute value is on the DonaldTrump question, and it is positive. I am not sure which direction the question goes, but if I'm reading the data right, it seems the vast majority of respondents voted 1 or 2, so i'm guessing that's the side that _dislikes_ trump, as I dont think ACX readers are much of trump fans. Similar reasoning goes for the next two highest positive correlations, politicalaffiliation and globalwarming. Are the conservatives getting something right that the rest of us are in the dark about? Or maybe just 2023's questions happened to be relatively easy for conservatives, adn it might be different in different years? Regardless, the highest R value is still less than .22, so it doesn't mean an enormous amount regardless

Expand full comment

Reply (3)

Mar 5, 2024Edited

Superforecasters seek out other views, and if ACX readers lean left, a conservative ACX reader would be doing just that.

Probably helps Scott is a lot nicer to conservatives than most liberals, especially these days.

Interestingly, there are positive correlations with PTSD, use of LSD, BMI, and alcoholism (maybe the Cassandra archetype had something to it?), and negative correlations with SAT scores (math more than verbal), immigration (is that supporting immigration or being an immigrant?), and forecasting experience...and trusting the mainstream media, which I guess makes sense. :)

Expand full comment

Sorry for the confusion, but a positive correlation actually means worse prediction performance--see my other comment in this thread.

Expand full comment

So it's the standard 'liberals and smart and healthy people do better'. Lame!

Expand full comment

Tentative explanation for the political correlation: people who read ACX or related sites for the rationalism did better than people who read it for the politics; the former are overwhelmingly liberal or left-wing like people in most intellectual pursuits, while the latter are more mixed.

Expand full comment

After I wasted a bunch of time trying to explain correlations that turned out to have a minus sign attached to the way I was reading them (in short, I made the pessimal interpretation), I'm not up for a second try.

Expand full comment

Mar 28, 2024

sometimes, the truth is difficult, painful, or horrifying

sometimes it's just boring

Expand full comment

Mar 30, 2024

Indeed, you would expect it to be boring much more often than not, since deviating from expectations is what makes it not boring.

Expand full comment

I used the Brier score, where a lower score is better (Scott hadn't announced the scoring method that would be used, and it turns out they used the Metaculus scoring function instead). So positive correlations actually mean worse prediction performance. I realize now that I only describe that on the "Simulating outcomes" page of the app--I'll add a brief explanation to the other pages now.

Expand full comment

Ooops! My bad. I should have read the rest of the sections more carefully I think. But that explains a lot lol

Expand full comment

Mar 6, 2024Edited

I'm guessing that the big pro-conservative surprises were Trump getting indicted and Ukraine's offensive failing.

Expand full comment

This is really interesting, thanks!

Expand full comment

Love this. In the Supervised Aggregation tab, what are the axes on the graphs (the ones that show which features correlate with best Brier)? IIRC the questions from the original survey have a lot of different endpoints so it may not be possible to summarize them concisely here.

Expand full comment

I should try to add those labels back in to the x-axis where it is feasible.

This function https://github.com/jbreffle/acx-prediction-contest/blob/f192fa2e096617c2ea6d18fa380ad7264cdfe8e9/src/process.py#L64C5-L64C19 is where I define the feature values. Many of the survey questions were just ratings on a scale of 1 to 5. For defining medical diagnoses check the variable diagnosis_dict. I also did some feature engineering for some of the more complex survey questions.

Expand full comment

The Blind Mode raw data thing on the main page throws an error when opening it.

Expand full comment

Thanks! I'll try to figure out what's going wrong.

Expand full comment

A_Hamilton

This is awesome. Any idea where we can download the raw data for "full mode"?

Expand full comment

He hasn't made that available, as far as I know.

Expand full comment

Another problem: in light mode, the "Predictions by experience" charts show the "All participants" line in white on white background.

Expand full comment

Thanks, I updated the theme. I hadn't considered the plight of light-mode users.

Expand full comment

MankiwsMom

I was pretty convinced I didn't have an ID key but I'm 90% sure now I was "correcthorsebatterystaple" like the XKCD comic. If this isn't me and someone else let me know. If it is me though, glad I finished *slightly* positive.

Expand full comment

ZachH

Maybe I'm missing something, but shouldn't Ezra Karger outperforming Metacalculus take into account that he has access to the Metacalculus current state? Much like human + engine outperformed engine for a long time, it's reasonable to expect a skilled individual to improve over the market in predictable rare cases.

Expand full comment

Reply (3)

Mar 5, 2024Edited

That's a good point, thanks (although most other Full Mode participants weren't able to outperform Metaculus).

Expand full comment

I'm glad I'm not the only one who has been reading it as "Metacalculus". Reading the post today was the first time I noticed it was "Metaculus", or its weird Mandela effect

Expand full comment

That depends on whether the human can recognize when the engine is making better predictions and get out of the way. And also if the human can say "No, I know better" correctly.

This appears to be difficult, and possibly not reliably trainable.

Expand full comment

ZachH

Mar 15, 2024

Agreed it's a real and important skill to outperform the engine/market even in small and subtle ways. That said, I know some of the prediction markets are reliably off in consistent ways. Some issues are related to transaction fees making bid-ask spreads bigger than one would think, while others are related to the opportunity cost of waiting until the market clears. Seems one could come up with reliable ways to massage a more accurate reading out of the market. Otoh, saying "the market is usually right but it's being v stupid here" is hard in general, though again not always when some whale is pumping one side or the other.

Expand full comment

Notmy Realname

The completely blind average of all guesses placing 95th percentile is astounding to me, and is the most convincing argument I've ever seen for relying on "wisdom of the crowds" for predictions

Expand full comment

Timothy

Agreed, it's seriously amazing. I wonder to what extent crowds are good at forecasts as opposed to other problems. I would guess estimations, like how far Moskow is from Paris, like in the last survey, would be similarly good, but maybe there are also areas where crowds just aren't as good.

People often claim groups are really bad at thinking. Congresses and collective organizations punish dissent and create group think. How can this fit together with the wisdom of crowds?

Is it as simple as anchoring? If everyone writes down their guess first, a crowd is great, but if people all say their guess one after another, do the later ones anchor on the first ones?

Or, normally, the higher status people predict whatever will make them look best, and then everyone else follows them because of anchoring and because people aren't anonymous in their predictions. So even if they know the high-status person is wrong, they don't want to go against the mainstream?

Or maybe groups share similar beliefs from the start; all the people who love communism come together, so they make bad forecasts because they all think the same from the start?

Or maybe it's just false that groups make bad decisions, and actually they are almost always better than individuals? Maybe political parties seem stupid, but if we were to look at individuals, they would be even worse.

I think it's probably the "doesn't want to go against the high-status person and get punished" and the anchoring thing. And this kind of global forecasting is probably where experts just aren't that good, so crowds are good in comparison.

If I imagine Walmart trying to decide if they should open a new store somewhere or start selling a certain product, would an anonymous poll of all employees outperform the CEO, two data scientists, and three managers just making the decision?

I suspect crowds won't be as good (compared to the experts) in that situation as here because the questions are easier. It is possible to be an expert CEO, but being an expert forecaster is really hard.

In the end, I really don't know. I could imagine the Walmart employees consistently being better than the CEO or worse. More experiments are needed.

Expand full comment

"Congresses and collective organizations punish dissent and create group think. How can this fit together with the wisdom of crowds?"

Anchoring, plus many political bodies have whips (it's a person, not an instrument) and other measures to ensure party loyalty. This can include things like your compensation next year (I assume that the head of a committee at least gets a bigger expense account than a private member).

The average pol answers to a large number of people. It's possible that he or she would do better if it was a one term no recall position.

Note that these prediction markets may involve small amounts of money. Getting it wrong (or not going along) doesn't mean you're unemployed next year and unemployable.

Expand full comment

Someone should look at whether medians are even better, as they are less influenced by extreme guesses of idiots.

Expand full comment

JR11

Any chance of releasing an .xls to calculate scores for those of us who have our answers saved (e.g., via the all-submissions file) but didn't include / don't remember their IDs?

Expand full comment

Daniel Frank

Pursuant to my post here: https://old.reddit.com/r/slatestarcodex/comments/1b6akix/what_are_your_favourite_ways_to_combine/

Is there an opportunity to stake someone/a group of forecasters money to enable them to bet in prediction markets? (I'm willing to receive a lower risk adjusted return in exchange for helping strengthen prediction markets)

Is anyone else doing this? If so, I'd be curious to know if there are any written details online.

Expand full comment

Bugmaster

> This means an event was unlikely, but happened anyway.

I understand that, in the long term, "black swan" events like this will average out. If everyone is predicting "business as usual", and you are predicting "asteroid strike" year after year, then if an asteroid strike actually happens you will appear to be a prophet just by luck.

On the other hand, though, "business as usual" is not interesting. Anyone can predict that simply by assuming that tomorrow will be exactly like today (adjusted for inflation); it doesn't take a superforecaster or an AI, just copy/paste. The whole point of prediction markets etc. is that they should be able to accurately predict *unusual* events, isn't it ?

Expand full comment

Incanto

Depends how valuable/actionable knowing about a given event in advance is.

Expand full comment

Bugmaster

Agreed; I was thinking of important events, such as the aforementioned asteroid strike.

Expand full comment

David V

I think the point here is, the event was still unlikely *even given all available information*; it wasn't merely "unusual". If all available information points to an event being unlikely, predicting such an event makes you a bad forecaster, even if it turns out to come true.

Expand full comment

Sergei

Instead of "50% on everything", I'd use the binomial distribution for probability density, or something similar, and sample randomly from that.

Expand full comment

Is there any scoring system where you would choose 50% on everything?

Expand full comment

dogiv

Why would that be better than 50% on everything? It would give your score more variance, but probably make it worse overall...

Expand full comment

Korakys

I successfully managed to figure out where I kept my ID key, but it doesn't appear on the list :(

Expand full comment

Eschatron9000

Same here. Scott, are you sure the attached file is complete?

Expand full comment

It should be. I'm hoping I can get someone to help me figure out a better method to get people their scores.

Expand full comment

fortenforge

Any chance you can include a ranking as part of the ID keys as well? I seem to have done pretty well and want to stroke my ego further...

Expand full comment

artifex0

So, why is it that the median participant scored worse than random?

Is it mostly explained by the fact that the world is full of people trying to convince people of untrue things? Is it that society collectively constructs a lot of illusions which aren't necessarily intentional? Is it the result of some common property of the particular questions that were asked? Some counterintuitive statistical thing that would show up even in a world where everyone was putting a lot of effort into avoiding bias and nobody was deceptive?

Expand full comment

Vitor

I'll go with counterintuitive statistical things. It's really really hard to viscerally feel the difference between 5% and 0.5%, and between 50% and 70%. Without looking at the data, I guess that people put their probability into the 70-90% range way more often than is rational.

I think this range roughly optimizes "feeling good about your prediction ex-post". If you are right, it doesn't feel like you left much money on the table. If you are wrong, you can claim that you were "unsure".

Expand full comment

Emily

I wonder if there's a way to teach people to assess the meanings of percentages better and if those people would score higher? Then you would expect that an aggregate of the scores of those people to do better than the aggregates of the scores of everyone.

Expand full comment

Multicore

I expect it's because the scoring rule punishes extreme wrong predictions more than it rewards extreme right predictions, such that getting just a few things very wrong wrecks you. The 50% guesser only ever gains or loses a few points on every question, so it has no risk of taking a big hit like that.

(Epistemic status: I didn't fully understand the linked Metaculus page https://www.metaculus.com/help/scoring/)

Expand full comment

I don't know how many of the participants went for a 'max my variance' strategy (all 1 and 99 predictions), but there were a fair number. Would explain part of it. More generally, even correct use of game theory (trading some correctness for more variance) would tend to reduce your average score even as it, perhaps, increases your chances of a first place finish. Harder to see how much of that was going it, it wouldn't stand out as much, but any significant number of 1 and 99 choices might reflect an attempt to do that.

Expand full comment

I mean, 50% isn't really random. We're not talking about a coinflip, we're talking about a long list of independent and disparate questions, with a huge range of unknown actual distributions.

The real question is 'why was 50% across the board such a good answer to give on this battery of prediction questions?'

I assume the answer is something like 'Scott tried to include a good balance of things that seemed likely vs unlikely' or 'an artifact of the scoring system means that 50% loses the least points whenever you are wrong and that can make 50% score higher than your best guesses even if you have some knowledge' or something like that.

Expand full comment

MoreOn

Mar 8, 2024

Would 45% across the board, or 59%, yield a better percentile? What would that say about the question bank?

Expand full comment

Loweren

To help find Small Singapore, I forwarded the screenshot to our local @LessWrong_Singapore group on Telegram.

Expand full comment

Btw, how can I get into the group? I'm in Singapore, too.

Expand full comment

methylxanthine

Mar 5, 2024Edited

Congrats to those who won! I'm just glad I did a bit better than the median and better than just guessing 50% on everything.

Expand full comment

"Why would bad forecasters ever beat good forecasters? This means an event was unlikely, but happened anyway."

It can (I think) also mean that the likely outcome happened, but good forecasters were less confident in it than bad forecasters. That seems to have happened to the questions about whether Sevastopol, Luhansk (Russian controlled), or Zaporizhzhia (Ukrainian controlled) would change hands—none of them did, and bad forecasters beat good forecasters on each.

Expand full comment

Reply (3)

You would need to be able to make a strong case for these outcomes being genuinely uncertain in order for them to be good examples. I'm not sure that's the case here. 'Russia continues to hold Sevastopol' was as I recall one of my highest probability predictions and I don't, and didn't, feel like it was particularly overconfident.

Expand full comment

I don't know that they were genuinely uncertain, what I see is that in the table, they each have negative numbers in the last column, which means that bad forecasters beat good forecasters, which I think means that good forecasters were less confident. Why that was the case, I don't know.

Expand full comment

Michael Watts

It seems like the right way to ask "which events were the most surprising?" is the very simple method of ranking all the events by the difference between their posterior probability (either zero or one) and their average forecast probability. That's what it means for an event to be surprising. I don't understand why Scott presented the mysterious metric instead.

Expand full comment

He presented both metrics.

Expand full comment

Michael Watts

Mar 6, 2024Edited

Not quite; if I understood the post correctly, the other metric he presented was "number of people who got the call wrong", with the amount by which they got it wrong not taken into account. ("A more negative number means that more people got the question wrong.")

But to measure how surprising an event is, you actually do want to consider how confident people were that it would happen.

Expand full comment

"The first colored column represents average score on each question." That suggests the amount they got it wrong by is taken into account. The next sentence, the one you quoted, is a simplification.

Expand full comment

Michael Watts

Mar 7, 2024Edited

First, we can easily observe that unless confidence is not taken into account, the second sentence is not a simplification; it's just false. There is causality between the two numbers, but it runs in the opposite direction to what is claimed.

The second issue is not so simple - Scott doesn't specify his scoring rule other than to say that it's the Metaculus scoring rule while linking to a page that explains that Metaculus uses multiple scoring rules, and that is vague on how to calculate them.

But if we assume that he's using the log score or the baseline score, which is stated to be a rescaling of the log score, then it would be true that "amount of wrongness" would be taken into account, in an obscure way, but it would be false that Scott presented the obvious metric I suggested. The average log score on a question does not follow the same ordering as the average probability assigned to the question (that is, it is not the case that when one question has a higher average log score than another question, it also has a higher average assigned probability than that other question) and is therefore not even capable of answering "which events were the most surprising?" according to the average-probability-assigned metric. It is even _less_ capable of measuring _how_ surprising events were according to that metric.

And Scott has done this even though he's also noted explicitly that he calculated average assigned probability for every question. Why?

The score is a device intended to elicit honest probability estimates. It is not itself a measure of likelihood - that's what the probability estimate is! To measure surprisingness, you want to use the probability estimates.

Expand full comment

"First, we can easily observe that unless confidence is not taken into account, the second sentence is not a simplification; it's just false. There is causality between the two numbers, but it runs in the opposite direction to what is claimed."

That would be the case for Brier score, but I assume not for the Metaculus score. I assume Scott included the second sentence to make it clear in what direction the correlation between the score and accuracy goes.

Expand full comment

And more to the point - 'good' and 'bad' forecasters in this analysis are only relative to this specific list of questions.

It's totally possible that the questions fall into clusters where one group of people or another is more likely to be right. Like, it could be political vs economic questions, or red vs blue tribe issues, or etc.

Example, if 70% of the questions were mostly political and 30% were mostly economic, then people who mostly follow and know about politics would be 'good' forecasters' and people who mostly follow and know about economics would be 'bad' forecasters.

In that case, we'd be saying 'bad forecasters beat good forecasters on all these questions about the economy! Did the economy do some crazy surprising thing this year that no one could have expected?'

No, you just implicitly defined 'bad forecasters' as 'people who mostly know about the economy', it's not weird that they were most accurate on those questions.

This is example #930485 of how this stuff is all actually really complicated and hard to talk about, and it's easy to follow your intuitions off a cliff if you're not careful.

Expand full comment

Mar 5, 2024Edited

This particular point would apply if attention to different topics is anti-correlated, at least among people who participate in the contest.

Expand full comment

Well, specifically when talking about outliers, since we're comparing best vs worst.

I think this makes sense... there could be a general factor of 'education or 'intelligence' or w/e across the entire world population that makes people who know more than average about politics also know more than average about economics.

But limiting the sample to readers of this blog who want to participate in a prediction markets experiment probably flattens most of that variance already. After which, it seems pretty reasonable to expect that among the *best* scorers on politics vs the *best* scorers on economics, those outliers would be people with a particular dedication to one topic that draws against similarly-extreme positive performance on the other.

(maybe economics and politics are bad examples because they're intertwined, it could be something like 'politics vs celebrities' or 'red tribe sacred cows vs blue tribe sacred cows' or 'STEM major type of questions vs Humanities major types of questions' or etc, depending on the actual list of questions)

Expand full comment

Jon Simon

One of the questions asks about a deepfake attempt that made front page news, and it resolved yes. Could someone link the story?

Expand full comment

Multicore

Dunno what was decisive for Scott, but there are a bunch of examples linked in the comments of the Manifold market: https://manifold.markets/ACXBot/47-will-a-successful-deepfake-attem

Expand full comment

Archibald Stein

You could release a redacted version of Small Singapore's email. Like **anderson*****@gmail.com

Expand full comment

Their email was SmallSingapore@[redacted].com

Expand full comment

Steve Sailer

You should also ask a fun essay question: What is the most surprising thing that will happen in this coming year?

Granted, there's no way to objectively evaluate the responses, and it would be tedious to read through a zillion wild guesses. But so what? A year later, just let people grade themselves in the comments section by evaluating their predictions.

Expand full comment

Steve Sailer

Have there been any studies of five or ten year predictions? It seems like Tetlock's superforecasters have pretty much cracked the code on how to be above average on 12-month predictions (e.g., just because you are pretty sure something will eventually happen doesn't mean it will happen in the next year). But it could be that 5 or 10 year long forecasts require a different set of techniques.

Expand full comment

Matt Smith

There are lots of email addresses in the Idkeys file, not sure if those folks realized what they were doing. @Scott Feel free to delete this comment if you were hoping for security through obscurity.

Expand full comment

Muskwalker

There's also one (not an email address) that includes the tag "(pw; do not show nor my email)"

Expand full comment

Mar 5, 2024Edited

I used my email fully expecting to be able to see my score... and so did the 4 friends I convinced to do this.

Pretty please can we get some version that includes the emails? You could hash the emails with high collision (will email proposed solution and get review)

Expand full comment

+1 to this

Expand full comment

anish

same.

Expand full comment

Angela

I know a person who against all the apparent odds and all the pundits made a lot of money on the initial Trump win. Many other issues, but his ASD certainly helps him see through the groupthink. I wonder if this is a common thread. Algorithms can't predict that.

Expand full comment

Do you know if his winning percentage is above 50%? Democrats sometimes point out that their party "won or outperformed expectations in 2018, 2020, 2022"*. Someone who sees through groupthink should also be outperforming the market/other bettors in the majority of other federal elections, including those when Democrats have been underestimated.

*Source: https://www.nytimes.com/2024/03/02/us/politics/biden-poll.html. ChatGPT agrees that this is what happened 2018, 2020, 2022. I understand that plenty of you don't trust either source, and may want to confirm what prediction markets and betting odds were before each election.

Expand full comment

"won or outperformed expectations" is a tricky phrasing. They won in 2020 and outperformed expectations in 2018 and 2022, but fell below expectations in 2020.

I think trump's somewhat above 50% now but far from a sure thing, polling errors is hard to predict. Nate silver has a general claim that polling errors is always in the opposite direction to what you'd expect, since pollsters are biased against publishing polls if they seem too weird.

Expand full comment

How'd he do in 2020?

Expand full comment

Mar 5, 2024Edited

> weirdly, good forecasters were more likely than bad forecasters to believe Bitcoin would go up at all, but less likely to believe it would go up as much as it did

For the 2+ Bitcoin questions, I'm trying to visualize a possible spread of forecasters on a single scale, from bearish to bullish, based only on this sentence. Is it that bad forecasters tended to be either bearish or ragingly bullish, while good forecasters tended to be somewhat bullish?

If so, a simply story occurs to me. Good forecasters thought that speculators would return to the market before too long, but underestimated how many. Relative to them, bad forecasters have more dramatic expectations for emotive things, but were split between "it was a bubble that will never re-inflate" and "of course we're still heading for the moon, fairly soon".

For the record, I don't think stories like that have much value if we can't come up with them in advance, but at least they might satisfy curiosity.

I give a 70% chance that I'm among the bad forecasters overall, but I don't recall how I answered for Bitcoin, or for anything, when I did blind mode.

Expand full comment

If you believe in efficient markets, than the expected value for the price of a financial asset in a year is today's price plus a modest return.

Bitcoin sees a lot of attention by professional investors (in absolute turns), and shorting it is possible. So you could expect that the market for it is reasonably efficient.

Expand full comment

Mar 5, 2024Edited

I understand that EMH is why individuals shouldn't bother picking stocks and bonds. I have no thoughts on whether it is precisely true of professional traders of stocks and bonds, or only close to true. (Why didn't investors bid NVIDIA up to current levels sooner? What was the new information that explains the delay?)

But Scott noticed something interesting in his data. Conditional on the decision to look for an explanation other than randomness, aren't we necessarily looking for an EMH-incompatible explanation?

If there are exceptions to EMH, I would expect them to be for something like Bitcoin, where there is not a lot of data on the intangible social dynamics that precede a recovery from a crash. Professional investors might have waited to see memes on Reddit and Discord reach a certain hard-to-define threshold before pulling money out of other assets and (re-)investing in Bitcoin. Forecasters, who probably overlap more with the amateur diehards and amateur fence-sitters, might be better at predicting the burgeoning of the memes.

Even if that is true, over time, I would expect convergence with EMH.

Cryptocurrency doesn't interest me, so I don't follow Bitcoin at all and don't know if this story resembles the recent recovery.

Expand full comment

A major effect on Bitcoin this year was legalizing Bitcoin ETFs. I don't know how predictable it was that this might happen this year. I think the reaction was much stronger than expected.

Expand full comment

When Bitcoin futures trading was started on the Chicago Mercantile Exchange in late 2017, bitcoin prices dropped a lot. So it was not clear to me a-priori that legalising Bitcoin ETFs would necessarily increase the price of Bitcoin.

(Futures and now ETFs make it easier for participants from traditional markets to own Bitcoin, but also to short Bitcoin, which might help bring the price down.)

Expand full comment

Bitcoin went *down* when the ETFs came out.

Expand full comment

Sort of, it had gone up in anticipation of it already and then reverted a bit (this is a common thing with index trades - a lot of market players buy up supply expecting to sell to ETFs once the index adds them, and if it gets crowded there's a spike down once the event actually happens as everyone tries to exit position at once).

Expand full comment

EMH can only every be approximately true: you need at least some small or infrequent mispricings to pay the smart people and their computers to hunt for them and correct them, by trading on them. Otherwise, they would go into a different line of business.

(The worse the mispricing, the more reward for spotting and correcting them. Fewer and smaller mispricings, fewer and smaller rewards. In equilibrium, we expect the best professionals to barely make enough money to convince them to stay in the business.

That doesn't mean those professionals will be starving. Far from it. It just takes a lot of pay to convince them not to go work for Google or become CFOs or rocket scientists instead.)

EMH does not forbid prices from swinging widely, even on no legible news. EMH only requires that those swings are hard to predict. So Nvidia going up a lot is perfectly fine; as long as people can't predict those changes reliably in advance.

Having said all that, Bitcoin being influenced by memes would be perfectly compatible with the EMH.

Expand full comment

Arrk Mindmaster

It is well-known that the efficient market hypothesis is wrong. Stock prices can change wildly in a day based on nothing more than rumors. Markets are based on people's expectations (forecasts) of future valuations, so it's really the ultimate example of prediction markets.

Professional finance people ought to be the superforecasters of the markets. Yet even superforecasters are wrong sometimes.

I see no way to estimate a fair valuation of a Bitcoin, so have never invested in it. It's only worth what people think it's worth, and I have no idea what people will think it's worth in a week, let alone a year.

Expand full comment

Ghillie Dhu

>"It is well-known that the efficient market hypothesis is wrong."

This really depends on what you think of as *the* EMH; there are several versions that range from trivially true to obviously false. Ultimately they're akin to a physicist's frictionless vacuum: a deliberate oversimplification to enable analysis of essential dynamics.

Expand full comment

Arrk Mindmaster

Because of your comment, I looked further into EMH, and you are correct; I was only going by what I learned about 30 years ago.

I stand by the markets being prediction markets, though, as that seems more accurate a description than EMH. Two investors may legitimately disagree on the proper valuation of a stock, based on different assumptions of future value based on all current information. The investor most consistently accurate makes the most money.

Expand full comment

Mar 6, 2024Edited

I'm impressed you actually changed your mind!

Personally for my own investment I go by a very weak form of the efficient market hypothesis: 'By the time I personally hear about any news, the people (and computers!) at Goldman Sachs and in the hedge funds will have already traded on those news fifty times over, so *I* shouldn't bother trying to trade on news.'

Slightly stronger forms of the EMH say things like 'current asset prices are out best predictors for future asset prices' (and once someone invented a better predictor, they will trade on it and thus both incorporate the better prediction in the market prices and make a lot of money.)

I don't know what version of the EMH you have in mind that would be invalidated by stock prices not being constant?

None of the version of the EMH I know of claim that stock prices have to stay constant until news that's legible to you (or me) comes in.

None of the versions even say asset prices are a particularly good predictor in absolute terms. Just that we don't have any better predictors; because better ones will get priced in.

All this was known in the mid 1990s too, I think? Though perhaps wasn't quite as widely appreciated, because it was before index investing really took off.

Update: I re-read the Wikipedia article https://en.wikipedia.org/wiki/Efficient-market_hypothesis and some of it can be read to suggest that the Efficient Market Hypothesis would disallow wild fluctuations in assets prices when there's no _legible_ news. Eg sentences like 'In the competitive limit, market prices reflect all available information and prices can only move in response to news.' That's an unfortunate formulation; and the article oscillates between those and more careful formulations.

Btw, if stock price volatility was a lot lower, options would (need to) become a lot cheaper.

I am stressing _legible_ news, in the sense of 'Seeing like a State'. Without the _legible_ quantifier almost everything can count as news, especially since everything is connected. A really simple example:

Share prices are our best predictor of the discounted value of all future dividends. Suppose you have a company with very stable dividends: the share price can still vary widely, because people are also betting on the discount factor, ie interest rates. And those are set by conditions in the wider economy, and thus eg rumours coming out of North Korea could influence the market consensus about the probability of a PRC invasion of Taiwan, which would then impact expectations about the wider global economy, and thus the share price of our example company.

Especially when interest rates are very low, a small change in them (or in our expectations) can have a big impact.

Expand full comment

Arrk Mindmaster

I agree particularly with the idea that one cannot trade on news because others are faster at it.

EMH, as I understand it, would mean that, other than new news coming out, stock prices ought to vary only a little day-to-day, based only on the time value of money, such as the day of a dividend payout getting closer, or a company's retained earnings creeping up.

I find option prices to be correlated with stock price volatility, so I certainly agree that, were volatility reduced, option prices would go down. But I find options pricing nowadays to be pretty reasonable, most of the time.

If a stable-dividend company's stock price varies according to interest rates and other factors, so too will the market in general, so that effect is largely dissipated except for some select companies, such as gold mining companies, or companies which are merging.

Expand full comment

I think the simplest explanation is 'people who really obsessively follow bitcoin and know a ton about it are not as knowledgeable about other topics'.

Expand full comment

tailcalled

Has anyone performed a factor analysis on this data? In general for forecasting tournaments, it might be a good idea to have coverage of a variety of factors corresponding to different world domains to forecast. It could help disentangle luck vs general forecasting skill vs specific domain knowledge.

I suppose I'm the psychometrics rat so maybe it's my job to do the factor analysis... 😅

Expand full comment

Mo Nastri

I vaguely remember the luck vs general skill vs domain knowledge factor analysis thing having been done, and a writeup published, on data from another forecasting platform, Metaculus I suppose... do you happen to know what I'm gesturing towards? Please ignore if not!

Expand full comment

Nancy Lebovitz

This is a awful question (more math, probably more chance of random results), but what if the skill of forecasting is more specialized? Some people might be good at particular regions or subjects, but not great forecasters in general.

Expand full comment

In prediction markets, you don't have to make trades in every single question.

And for this contest, I think you didn't have to answer all questions either.

Expand full comment

Nancy Lebovitz

Good point-- Scott did mention people who answered all the questions.

On the other hand, I don't know if people who didn't answer all the questions got as much analysis.

Expand full comment

In my annual contests, I ask participants to self-identify their subject-matter expertise (on topics like Celebrity, World Politics, Sports, etc., that I then map onto the contest props). The best generalist forecasters, identified by their long-run Brier score across multiple prior contests, perform better than subject-matter experts on the relevant props.

Expand full comment

Robert Jones

Mar 5, 2024Edited

Taking at face value the result that the market mechanism subtracted value (compared with a straightforward average of participants), I have a hypothesis as to why this might be, which is that markets can be surprisingly concentrated, so that a lot of weight is being given to the view of one individual, undermining the benefit of aggregation.

At one point a single user owned a majority of the Manifold Yes shares on question 14. Essentially, a single person who was rich in mana became convinced (wrongly, as it turned out) that the question must resolve to "yes", and pushed the prediction to 90% or so. As I've noted now several times, a single user currently holds 28% of the Trump Yes shares on Polymarket: since late January, they've returned to the market whenever the price has dipped to 53c (which reflects a higher probability than assigned to this outcome by any other prediction site, although not by much). Looking at other candidates, many of them also show 1 or 2 users owning a similarly high proportion of the shares.

These people are no doubt confident in their views, but even if we think that very confident people are in general more likely than random to be right, they are still single individuals, and the whole point of the "wisdom of crowds" is that the aggregate view usually beats the even the wisest individual.

Part of the attraction of markets is that they exclude blowhards: people will confidently assert that their favourite sports team or politician is all but certain to win, but they typically won't bet on that belief, because at some level they know it's false. But, once you have filtered to people who are participating in a prediction market, you have already excluded those people. So the question is, should we expect weighting (in effect) by the size of bets to give a better result than taking a raw average?

At first glance, one certainly might say that a person who is willing to bet £1,000 on an outcome is more confident than a person who is willing to be £10. But there are many other reasons why somebody might not be willing to bet £1,000. They might not have a £1,000 spare, or they might be sufficiently risk averse that they will not tolerate the significant chance of losing the bet, even at favourable odds. This only becomes worse when one considers people who are willing to bet £1,000,000: the vast majority of people would not make a bet in that size at any odds on any proposition. It seems reasonable to guess that the sort of person who is willing to bet £1,000,000 on the outcome of the US presidential election is the same sort of person who is willing to bet £1,000,000 on which horse will win the Grand National, rather than the sort of person who makes extremely accurate predictions.

There is also the problem that it is characteristic of poor prediction agents that they are overconfident: they say an outcome is 99% likely when it is really only 80% likely. That raises the possibility that giving more weight to the views of people who are very confident (as reflected by their bet sizing) is subtracting value.

Expand full comment

Joseph

My naïve guess is that the average quality of Scott's participants exceeds the average quality o the market participants, and that the number of Scott's participants is high enough that Scott has a higher "wisdom of crowds" effect or that if the market has a higher effect, the advantage is marginal.

You could also hypothesize that there's some advantage in this space to giving everyone one vote instead of weighting their opinion by the amount of money they're willing to invest, but I would find that surprising.

Expand full comment

isak melhus

"Other resolutions that **book** people by surprise" error

Expand full comment

jlohner

Does anyone know if "macro" public intellectuals (guys like Balaji, zeihan, samo burja) ever participate in stuff like this? If so, how do they do? If not, should we take it as a sign they're not as confident in their proclamations as they claim to be? Or that one year isn't a long enough time horizon for "macro" trends to apply? Or that they don't think it's worth their time?

Expand full comment

Matt yglesias did it last year (he scored reasonably well but not in the top - somewhere like 70-80th percentile iirc).

Expand full comment

@thatMikeBishop

Samo Burja has tweeted (weak IMO) reasons not to make probabilistic forecasts. I haven't heard of the others addressing forecasting... once you have a reputation, making systematic forecasts like this mainly has only downside. Very honorable of people who already have a reputation to continue participating on principle.

Expand full comment

Mr. Doolittle

Scott, on the ID key spreadsheet there are some email addresses used as the code, which people may not want to be visible. There's also one that has a note as part of the key asking you not to share the rest of their key.

Expand full comment

E Dincer

Dammit I lost my ID key, I can find myself based on my answers in the xls for second round, but not my score in the latest xls. I was quite hopeful I'd have a good score but apparently I'm overestimating myself :D

Expand full comment

Skerry

https://docs.google.com/spreadsheets/d/1SQW-QGcMkwTTnguIHHqkn98ru9M1oBnhtmCrz2EMbw4/edit?usp=sharing

The scoring system is the new Metaculus Peer Score? In last year's contest, the aggregate was only at the 84th percentile- I assume some of the improvement is that there were >6x as many forecasters, but does the new scoring system also have properties that make it harder to outperform an average of forecasts?

Expand full comment

Bolton

Mar 5, 2024Edited

I made a spreadsheet to track the performance of the mean and median aggregates against the Manifold probabilities at the time of the Blind mode cutoff. I didn't get that the average did better than Manifold like the above chart shows: I got that the mean-aggregation was the worst, Manifold was in the middle, and the median was best (but these last two were very similar - less than 1:2 likelihood ratio between the median and manifold).

I would be curious to know if I did something wrong here, or if not, what the discrepancy is between how I processed the data and how the chart above was made.

Expand full comment

Bolton

Even the log-score page of the spreadsheet that I created just now says the same thing: Median > Manifold > Mean

Expand full comment

@thatMikeBishop

I'd like to see how the rankings vary using different scoring methods... e.g. impute a forecast of 0.5 for missing forecasts and then calculate mean of log score on all questions. I expect a fairly high correlation, but some changes in relative rank among the top finishers.

Expand full comment

I feel like the thing I actually want to compare the best forecasters to is something that I don't know exactly how to mathematically define, and it's something like 'For a data set of this size, if you took the mean response on each question and the variance around each question and everyone guessed a number from that distribution, how good would the best performer from that process do?'

Ie, intuitively: if we model everyone in the study as having the same level of knowledge plus some random biases that shift their answer a bit on each question, and look at the luckiest responses from that distribution whose random variance happened to match reality the closest, how well do they do compared to the best real-world responders?

Basically, I think comparing the best people to 'here's what you get if you put 50% for everything' or even 'here's the average for the whole sample' doesn't mean very much, because you're comparing a population mean to an outlier. It's possible for the best outlier from a large sample to do very very well just by chance, although that becomes a less likely explanation the more times they do well.

So, to say if the top responders had better abilities or just got lucky, you need to ask what score the best outlier from a process like this should produce, if everyone has the same ability plus a plausible amount of random noise.

Expand full comment

Yug Gnirob

>you need to ask what score the best outlier from a process like this should produce,

Well, there's never an upper ceiling; every participant will have the potential to guess everything correctly, so you have a <participants>/<possible answer combinations> chance of getting at least one perfect score. Then for lower scores you chop questions off the <possible answer combinations>side.

Expand full comment

Schneeaffe

> For a data set of this size, if you took the mean response on each question and the variance around each question and everyone guessed a number from that distribution, how good would the best performer from that process do?

Which part of the math is a problem there, that sounds pretty defined? Your null hypothesis is that answers of different people to a question are normally distributed with mean and variance equal to the real ones, and independent from their answers to other questions.

Expand full comment

More like I'm not sure whether that is actually the correct thing to be comparing against, or if that's just an intuitive approximation and there's a more proper way to model the thing I'm trying to get at.

Agreed that if that off-the-top-of-my-head methodology is actually proper and meaningful, it's not too hard to test (with Monte Carlo simulations, if nothing else). But that methodology was an ass-pull to try to match my intuition of what we need here, and I was hoping other people would understand what I was gesturing at and maybe have more rigorous ideas about how to represent it.

Expand full comment

Schneeaffe

Youre correct that that models no skill differences, and so comparing it to actual winners will tell you how skilled they are. That said, comparing the average participant with the previous winners should give you the same information with slightly different modeling assumptions, in a format that I think is better, because you calculate the effect of sample size into the model, and the previous winners stat removes it from them, which makes the numbers more intuitive to interpret.

Expand full comment

warty dog

I don't get why not just publish results with pseudonyms if people want to look themselves up

can we get (anonymized?) answers for non-blind stage?

Expand full comment

Ian S

I'm probably not that Ian, and didn't see anything in my spam folder.

Just on the off chance though, the general format of my email is 'ian.[last name][two digit numeral]@gmail.com. Is that at all like the email you have for the mystery Ian?

Expand full comment

Mar 6, 2024Edited

Was the high negative score on China COVID cases due to blind mode participants not knowing that China had already stopped reporting cases? It seems like a weird question to ask either way.

Expand full comment

Max Langenkamp

Thank you for organizing this, Scott! It was a great pleasure to participate — I ended up writing up a reflective post partly in honor of this tournament https://maxlangenkamp.substack.com/p/on-predictions

Expand full comment

pale ink

https://manifold.markets/paleink/will-a-winner-of-acx-prediction-con

hello, can i confirm that these markets can be resolved negative?

https://manifold.markets/paleink/will-a-winner-of-acx-prediction-con-5f23638f3765

thanks in advance!

Expand full comment

Liz Ryan

My guess about Small Singapore is that they're some mover and shaker in a tiny territory that aspires to be like Singapore. Well done, anyway.

Expand full comment

RenOS