Fair warning. I did a 2 of Tetlock's forecasting contests. I found myself bothering my friends who were subject matter experts and spent way too much time on it (tho kinda proud I was top 25 in the COVID one, until I quit).
The Tetlock contest is a huge amount of work because you can keep updating all year long. I didn't read the rules for this one: Is it just one time winging it or are you expected to grind for 11.4 months?
I saw this on Metaculus before it got posted here and wondered if I had missed something. Glad to see this is up now, the prediction contests are one of my favorite parts of ACX.
may as well roll the dice, anyone know of "group efforts" where people share their political shitposting and some basic research for every topic(off site)
I recently noticed a potential contradiction in Metaculus's rules:
- In the official Terms of Use (see https://www.metaculus.com/terms-of-use/ ) it states that you are not allowed to view the source code or make changes to it.
- HOWEVER, there is actually a GitHub repository (at https://github.com/Metaculus/metaculus ) which is released under the BSD license, and which I presume is more-or-less the production website. So that would seem to imply that you *can* view the source code and also make changes to it.
So first of all I imagine that Terms of Use line is absolutely not ever going to be enforced.
But like, I do not see a contradiction here at all? Just because you can doesn't mean you're allowed to. Maybe they only want people who don't have a Metaculus account to contribute to the website—I don't think that's very likely, but it's not self-contradictory.
Hey sorry about this, we had a bug last night that was preventing signups that has now been fixed. If you try to register again it should work, but if it doesn't please let me know!
Could someone explain how scoring works or send me a link to an explanation? I find the scoring details inadequate as someone who's never done a prediction competition and has no clue how scores are calculated. I just have an slight inclination that different scoring methods could favor different answering strategies.
The score is the (natural) logarithm of the probability which you assigned to the actual outcome. So, for example, if you said event A will happen with 70% probability, then:
If A happens, you assigned 0.7 to the actual outcome, so your score is –0.357.
If A doesn't happen, you assigned 0.3 to the actual outcome (the inverse of 0.7), so your score is –1.2.
The first score is higher (keep in mind they're both negative numbers), so that would be better.
The logarithmic score is always negative, except if you predicted 100% for an event that did actually happen, in which case the score is 0. But you should never do that because if the event doesn't happen, you predicted 0 and the logarithm of 0 is negative infinity, so you immediately lost forever.
People find it a bit weird that all scores are negative, so Metaculus introduced the "peer score" which centers them around the average, so that anyone who has a better score than the average score gets a positive peer score, but this doesn't change anything for the ranking of the predictions.
Logarithmic scoring is a so-called*proper scoring rule*, which are scoring rules that are mathematically designed in such a way that it's in your interest to report your true probability.
I've read that Brier scores are also a proper scoring rule. If this is true, why do you think Metaculus prefers log/peer scores? And why do others prefer Brier?
Log scoring is the only scoring rule that satisfies both of the following conditions:
:- Additivity: your score for a composite prediction of independent events is the sum of your scores for the components - that is, if we are flipping two independent coins, it doesn't matter if we score them separately as two events with two outcomes each, or as one event with four outcomes.
:- Monotonicity: If the outcome that Alice assigned to the outcome that actually happened is higher than the outcome that Bob assigned to it, then Alice will score higher than Bob; you can never increase your score by making a worse prediction.
To see this, observe that in order for a score to be monotonic, it must be a monotonic function of the probability you assigned to the observed event - in other words, the simplest possible scoring system is "Alice said that there was a 0.3 chance of seeing the combination of things we actually saw, Bob said there was a 0.2 chance, and so Alice scores higher than Bob", and all other monotonic score are just recalibrations of that. "Observed probability" is multiplicative: for independent things P(a,b) = P(a)P(b), and the way to rescale a multiplicative function into an additive one is to take the logarithm, which gives us the log score.
Brier scores are additive but not monotonic; to be honest I've never been sure why people use them either - from a purely mathematical perspective log score is clearly the "correct" way to evaluate predictions, but it may be that there are messy real-world applications where that isn't what you want.
I'd recommend changing the wording of "Will Iran possess a nuclear weapon before 2026?" to "Will Iran test a nuclear weapon before 2026?" That would be less arguable.
I'd like to see a separate forecasting contest in which contestants submit 250 word essays on what they forecast will happen that will turn out to be a very big deal in 2025 that isn't specifically asked about in any of the regular quantitative questions, with Scott's choice of the winner being final.
These kind of Tetlockian contests, such as Scott's collaboration with Metaculus, are excellent, but there is also something to be said for the flash of prophetic intuition in which somebody comes up with a forecast that so few people are thinking about that no questions about it are framed.
For example, in 1790, when the French Revolution seemed to be proceeding constructively, British politician Edmund Burke forecast that it would lead to the execution of the monarchs, terror, inflation, and end in a military dictatorship (e.g., Bonaparte six or seven years later). That's a famously impressive forecast.
Of course, these kind of forecasts are hard to score fairly. Fortunately, we possess in Scott a famously fair-minded host, so I would value his judgment.
Another problem is that more off the wall forecasts are harder to get the timing right than Tetlockian forecasts. A lot of becoming a Tetlockian super-forecaster is not underestimating how long things can bump along before they become a crisis.
For example, a massive war between Turkey and Greece over Cyprus, with disastrous consequences for NATO, is not likely to happen in 2025. But reasonable people can disagree over how worrisome it is in a longer timespan. Tetlockian super-forecasters are good at reading articles by experts on Cyprus saying, "Hey, everybody, it's important to pay attention to my topic of expertise for reasons" and remembering that they also said this many time in the past when it turned out it wasn't hugely important to listen to them and thus properly discounting Cyprus's likelihood to be important in the next 12 months.
But that doesn't mean a Cyprus crisis might not happen in, say, my ever-shrinking years left.
So, I'd also like to see Scott scan over 2025 forecast essays again in 2030 and see if any of the losers in 2026 now look like premature prophets.
>tl;dr: ChatGPT o1 1/18/2025 7 questions, tl;dr of results:
>a) correct
>b) partially correct (initially evaded answering part of the question, 1st prod gave wrong answer, 2nd prod gave right answer)
>c) mostly correct (two errors)
>d) correct
>e) initially incorrect, one prod gave correct result
>f) misses a lot, argues incorrectly that some real compounds don't exist
>g) badly wrong
From the rate of improvement that I've been seeing, I'm roughly 75% confident that by 1/1/2027 ChatGPT should get all 7 questions right. Kind-of
"What is an AGI to me?" ( to the tune of "What is America to me?" )
I think of this as roughly what a bright, conscientious Chemistry and Physics undergraduate should be able to do (with internet access, which current ChatGPT has, IIRC). This isn't exactly everything: Incremental learning is important too. Data efficiency is important too.
Is it not possible to use the mobile website to make predictions? I tried signing in via my Google account successfully but I can’t understand how I’m supposed to make predictions in the contest. Possibly I’m just an idiot…
You should be able to make predictions on mobile by moving around the slider under the question. Sadly as far as I know you can't just type in a number, which might be easier.
Don't know about anyone else, but I have a lot more uncertainty on the questions this year compared to last. I hate predicting at 50% but there are several where I'm close to that.
Metaculus became unusable for normal people after their UI redesign. The mobile experience is atrocious to the point of looking like it's a broken webpage. Their business model is explicitly designed to harvest predictor intelligence. $10,000 * .001 probability of success isn't worth the headache. I was top 10 for years and won't return until their UI and compensation models change.
They changed the prize payout so you no longer have to gamble to be in the money. It's more like $10000 * 1.0% = $100 if you're in the top 20 tapering down to ~$30 if you're top 100. Last year 1300-1400 people participated, and this year probably over 3000.
I liked the previous contests. This time, it was, well, less fun.
Before it was "think, click and go to the next question", now it was ... "find your way, register..."
Ok, did that, then "think, click wait for processing (just a few secs, but hey, who likes that, if it was meant to be fun not grinding) then find a way to get to the next question ... no one button for that, tried some ways, the best was to press that "return-arrow" ... and wait again for those questions to appear, find the next one ... click, wait, - and only then repeat the procedure ..." I cut back on the "thinking part" to compensate, and man, was I glad when it was over!
I do hope, the cooperation will bring better statistical insights than the old one - I am sure, Scott was and will be saving some of his valuable time - I sure wasted some of mine. While absolutely sure not to have any chance to get near any price-money.
I did this as I am an embarrassing fan-boy/disciple. Many are not. Will the numbers of those who take the full contest go up or down .... I bet 95% on: down.
So, now we need a meta alcohol account to participate?
It appears so. I have not participated this year for that reason.
Same, I will skip rather than setting up another account on the internet.
Meta alcohol is excellent
I am baffled by the website on mobile and have given up
Fair warning. I did a 2 of Tetlock's forecasting contests. I found myself bothering my friends who were subject matter experts and spent way too much time on it (tho kinda proud I was top 25 in the COVID one, until I quit).
Some people like the grind. I don't.
The Tetlock contest is a huge amount of work because you can keep updating all year long. I didn't read the rules for this one: Is it just one time winging it or are you expected to grind for 11.4 months?
I didn't read them either. Turned the page on forecasting contests. It is like a job after a while.
This one is just spot scoring - whatever your prediction is on jan 31 is what gets evaluated, no need to keep updating
Thanks.
I saw this on Metaculus before it got posted here and wondered if I had missed something. Glad to see this is up now, the prediction contests are one of my favorite parts of ACX.
may as well roll the dice, anyone know of "group efforts" where people share their political shitposting and some basic research for every topic(off site)
I recently noticed a potential contradiction in Metaculus's rules:
- In the official Terms of Use (see https://www.metaculus.com/terms-of-use/ ) it states that you are not allowed to view the source code or make changes to it.
- HOWEVER, there is actually a GitHub repository (at https://github.com/Metaculus/metaculus ) which is released under the BSD license, and which I presume is more-or-less the production website. So that would seem to imply that you *can* view the source code and also make changes to it.
So which of these is correct?
So first of all I imagine that Terms of Use line is absolutely not ever going to be enforced.
But like, I do not see a contradiction here at all? Just because you can doesn't mean you're allowed to. Maybe they only want people who don't have a Metaculus account to contribute to the website—I don't think that's very likely, but it's not self-contradictory.
probably that clause was written before Metaculus got open-sourced and then was forgotten about. I've opened an issue: https://github.com/Metaculus/metaculus/issues/2036
I thought I'd give it a try, but...."Wrong captcha" every time I try to sign up.
Tried both Edge and Firefox.
Scott said bots would be allowed to compete this year. Have you tried getting the captchas wrong?
Actually, there is no captcha other than the Cloudflare Turnstile widget that non-interactively approves itself.
https://developers.cloudflare.com/turnstile/concepts/widget/#non-interactive
Hey sorry about this, we had a bug last night that was preventing signups that has now been fixed. If you try to register again it should work, but if it doesn't please let me know!
Ouch! I've had the exact same sort of "oh noes we had a push for signups and signups are broken" type of bug before and it sucks!
Yeah not a fun one to wake up to!
I'm sorry you had to find out this way.
Could someone explain how scoring works or send me a link to an explanation? I find the scoring details inadequate as someone who's never done a prediction competition and has no clue how scores are calculated. I just have an slight inclination that different scoring methods could favor different answering strategies.
The score is the (natural) logarithm of the probability which you assigned to the actual outcome. So, for example, if you said event A will happen with 70% probability, then:
If A happens, you assigned 0.7 to the actual outcome, so your score is –0.357.
If A doesn't happen, you assigned 0.3 to the actual outcome (the inverse of 0.7), so your score is –1.2.
The first score is higher (keep in mind they're both negative numbers), so that would be better.
The logarithmic score is always negative, except if you predicted 100% for an event that did actually happen, in which case the score is 0. But you should never do that because if the event doesn't happen, you predicted 0 and the logarithm of 0 is negative infinity, so you immediately lost forever.
People find it a bit weird that all scores are negative, so Metaculus introduced the "peer score" which centers them around the average, so that anyone who has a better score than the average score gets a positive peer score, but this doesn't change anything for the ranking of the predictions.
Logarithmic scoring is a so-called*proper scoring rule*, which are scoring rules that are mathematically designed in such a way that it's in your interest to report your true probability.
Thanks.
I've read that Brier scores are also a proper scoring rule. If this is true, why do you think Metaculus prefers log/peer scores? And why do others prefer Brier?
> So by our metric, the log scoring rule is the best of the three commonly used rules at incentivizing precision.
from https://ericneyman.wordpress.com/2020/04/24/scoring-rules-part-3-incentivizing-precision/
Log scoring is the only scoring rule that satisfies both of the following conditions:
:- Additivity: your score for a composite prediction of independent events is the sum of your scores for the components - that is, if we are flipping two independent coins, it doesn't matter if we score them separately as two events with two outcomes each, or as one event with four outcomes.
:- Monotonicity: If the outcome that Alice assigned to the outcome that actually happened is higher than the outcome that Bob assigned to it, then Alice will score higher than Bob; you can never increase your score by making a worse prediction.
To see this, observe that in order for a score to be monotonic, it must be a monotonic function of the probability you assigned to the observed event - in other words, the simplest possible scoring system is "Alice said that there was a 0.3 chance of seeing the combination of things we actually saw, Bob said there was a 0.2 chance, and so Alice scores higher than Bob", and all other monotonic score are just recalibrations of that. "Observed probability" is multiplicative: for independent things P(a,b) = P(a)P(b), and the way to rescale a multiplicative function into an additive one is to take the logarithm, which gives us the log score.
Brier scores are additive but not monotonic; to be honest I've never been sure why people use them either - from a purely mathematical perspective log score is clearly the "correct" way to evaluate predictions, but it may be that there are messy real-world applications where that isn't what you want.
Thanks, this is extremely helpful. Now I know not to put 100% for anything 😅
Proofreading: the parenthetical (no “AI benchmark #44523, really!) is missing its closing quotation mark.
Good questions. I'm stumped by all of them.
I'd recommend changing the wording of "Will Iran possess a nuclear weapon before 2026?" to "Will Iran test a nuclear weapon before 2026?" That would be less arguable.
I'd like to see a separate forecasting contest in which contestants submit 250 word essays on what they forecast will happen that will turn out to be a very big deal in 2025 that isn't specifically asked about in any of the regular quantitative questions, with Scott's choice of the winner being final.
These kind of Tetlockian contests, such as Scott's collaboration with Metaculus, are excellent, but there is also something to be said for the flash of prophetic intuition in which somebody comes up with a forecast that so few people are thinking about that no questions about it are framed.
For example, in 1790, when the French Revolution seemed to be proceeding constructively, British politician Edmund Burke forecast that it would lead to the execution of the monarchs, terror, inflation, and end in a military dictatorship (e.g., Bonaparte six or seven years later). That's a famously impressive forecast.
Of course, these kind of forecasts are hard to score fairly. Fortunately, we possess in Scott a famously fair-minded host, so I would value his judgment.
Another problem is that more off the wall forecasts are harder to get the timing right than Tetlockian forecasts. A lot of becoming a Tetlockian super-forecaster is not underestimating how long things can bump along before they become a crisis.
For example, a massive war between Turkey and Greece over Cyprus, with disastrous consequences for NATO, is not likely to happen in 2025. But reasonable people can disagree over how worrisome it is in a longer timespan. Tetlockian super-forecasters are good at reading articles by experts on Cyprus saying, "Hey, everybody, it's important to pay attention to my topic of expertise for reasons" and remembering that they also said this many time in the past when it turned out it wasn't hugely important to listen to them and thus properly discounting Cyprus's likelihood to be important in the next 12 months.
But that doesn't mean a Cyprus crisis might not happen in, say, my ever-shrinking years left.
So, I'd also like to see Scott scan over 2025 forecast essays again in 2030 and see if any of the losers in 2026 now look like premature prophets.
Manifold has these "add your own answer" questions. It is "headline-size" instead of 250 word essays though. Here is an example: https://manifold.markets/Bayesian/what-will-happen-during-trumps-seco
Hmm... I'm frustrated by the 12 month limit (if I'm understanding it).
I just did a check of how well ChatGPT o1 is currently doing, see https://www.astralcodexten.com/p/open-thread-365/comment/87433836
>tl;dr: ChatGPT o1 1/18/2025 7 questions, tl;dr of results:
>a) correct
>b) partially correct (initially evaded answering part of the question, 1st prod gave wrong answer, 2nd prod gave right answer)
>c) mostly correct (two errors)
>d) correct
>e) initially incorrect, one prod gave correct result
>f) misses a lot, argues incorrectly that some real compounds don't exist
>g) badly wrong
From the rate of improvement that I've been seeing, I'm roughly 75% confident that by 1/1/2027 ChatGPT should get all 7 questions right. Kind-of
"What is an AGI to me?" ( to the tune of "What is America to me?" )
I think of this as roughly what a bright, conscientious Chemistry and Physics undergraduate should be able to do (with internet access, which current ChatGPT has, IIRC). This isn't exactly everything: Incremental learning is important too. Data efficiency is important too.
edit: I just saw https://openai.com/index/announcing-the-stargate-project/ which explicitly aims for AGI, and has explicit White House endorsement. So I'm bumping up my odds guess from 75% to 80%
Similar to my idea: https://www.astralcodexten.com/p/who-predicted-2022/comment/12164323
Is it not possible to use the mobile website to make predictions? I tried signing in via my Google account successfully but I can’t understand how I’m supposed to make predictions in the contest. Possibly I’m just an idiot…
You should be able to make predictions on mobile by moving around the slider under the question. Sadly as far as I know you can't just type in a number, which might be easier.
Don't know about anyone else, but I have a lot more uncertainty on the questions this year compared to last. I hate predicting at 50% but there are several where I'm close to that.
No thank you. I don't play fantasy football either.
Metaculus became unusable for normal people after their UI redesign. The mobile experience is atrocious to the point of looking like it's a broken webpage. Their business model is explicitly designed to harvest predictor intelligence. $10,000 * .001 probability of success isn't worth the headache. I was top 10 for years and won't return until their UI and compensation models change.
They changed the prize payout so you no longer have to gamble to be in the money. It's more like $10000 * 1.0% = $100 if you're in the top 20 tapering down to ~$30 if you're top 100. Last year 1300-1400 people participated, and this year probably over 3000.
I liked the previous contests. This time, it was, well, less fun.
Before it was "think, click and go to the next question", now it was ... "find your way, register..."
Ok, did that, then "think, click wait for processing (just a few secs, but hey, who likes that, if it was meant to be fun not grinding) then find a way to get to the next question ... no one button for that, tried some ways, the best was to press that "return-arrow" ... and wait again for those questions to appear, find the next one ... click, wait, - and only then repeat the procedure ..." I cut back on the "thinking part" to compensate, and man, was I glad when it was over!
I do hope, the cooperation will bring better statistical insights than the old one - I am sure, Scott was and will be saving some of his valuable time - I sure wasted some of mine. While absolutely sure not to have any chance to get near any price-money.
I did this as I am an embarrassing fan-boy/disciple. Many are not. Will the numbers of those who take the full contest go up or down .... I bet 95% on: down.
im still looking for people who want to talk about their answers before the contest closes