Yeah this is like when I made a prediction about "will I have kissed (the person I was romantically interested in) by a particular date", and then... I was off by exactly one day. Grr.
On the other hand, Biden barely started his term above 50%, and in April when these predictions were made he was hovering in the 53% range.
Combined with the general downwards trend of Presidential approval ratings through their term it shouldn't be surprising for him to lose 3% over eight months.
With the growing trend of polling stability for presidents, presidents fall off less than they used to. I think Trump lost more than 3% from the beginning of his presidency, but I'm pretty sure that between June 2017 and November 2020, Trump was entirely within a 3% range (except maybe for a couple days around an exceptional event or two).
I am not a fan of Biden at all, but given the tribal nature of US politics, I'd also predict the same at the beginning of the year. The kind of people that most pollers talk to usually lean slightly to the left, and having freshly elected Democratic president over 50% would not surprise me. Of course, that was before I witnessed what Biden administration actually looks like. I expected Biden to be bad, but regular level bad, and that'd keep him over 50% with tribal sentiment, some independent goodwill and the composition of the polls. But it turned out to be much worse than that, so that even the tribal sentiment couldn't hold him above water.
Haha, thanks for asking, Lars! We'd love to host Scott's predictions on Manifold Markets, so that the ACX audience can come and directly place their own bets.
I haven't been following it as much as I should. I'm checking it right now and see a 17% chance that Trump will be president next month, which suggests something has gone wrong.
There's two main challenges I see Manifold having right now:
1) "Trump has a 17% chance of being president next month"
The main thing with manifold market is its design basically lets anyone post a prediction, and empowers that same user to rule on its resolution. So for that particular prediction, pretty much nobody believes that's going to happen, but what you're seeing is essentially a bet that *that particular user* is going to close the bet honestly. Based on some comments the user has made a lot of people are somewhat skeptical.
It'll be interesting to see how this develops long term as certain users accrue a site reputation for resolving issues honestly and others don't. The main critique I've seen from skeptics about manifold's approach is that you can't trust people to resolve their own markets, but manifold presumably thinks it can work once you run it long enough and get it enough scale and get the reputation and community features in.
2) Not enough incentive/liquidity to push odds right to the edges
If I post a question on Manifold like "Will the sun explode tomorrow," with 24 hours to resolve, chances are it will only go down to about 5%. That's not because people believe the Sun still has a chance to explode, but because there's not enough incentive at the margins to push the question to 0% with a negative bet. You get better returns on investment for making contrarian bets, so you have to tie up a lot of capital for a small return just to push the remaining marginal percentage all the way down, and there's better things you could spend your fake money on. So even though there's a bit of "free money" left on the table for a sure bet like the sun not exploding tomorrow, it's not quite worth it to tie up the capital.
I wonder how much of this is a liquidity/volume problem, and how much is an incentives issue. Part of it is certainly a time preference thing -- if there's a sure bet ("will the sun explode next year") trading at 5%, then I'm even less likely to try to drive it down to zero because I don't want my capital tied up for a whole year just to get a meager return.
Nevertheless, I think Manifold is pretty fun and I enjoy the site and I'm willing to follow along to see if they can figure all this stuff out. It's become part of my daily site checking habit, which is a rare thing to be able to say.
However, surfacing any internet person's market is still a bit of a problem, because dishonest market creators could dupe some into losing money.
We are almost ready with our solution which is: curated subsets of markets! Anyone will be able to form a sub-community on a selection of markets, usually on a particular topic.
Each sub-community gets its own social feed and leaderboards. And, they will be able to exclude bad faith actors, so the markets you see will be clear and accurate.
I'm really excited about the future of Manifold Markets! I think we can make forecasting ubiquitous through ease of use and solid incentive structures.
When I score these at the end of the year I'm going to put them in buckets, 10 points wide. E.g. there'll be a 20-30% bucket. If I make 3 predictions at 20%, 4 at 24% and 5 at 30%, then I expect 3*20+4*24+5*30 = 306%, so about 3 of them should come true. The reason I plan to do it this way is so that I don't have to restrict myself to %ages that are multiples of 10. E.g. I might have an event with a 3% probability. I may write some Python software to draw pretty graphs of the results; if so I'll put it on Github so others can use it.
binary predictions don't have a built-in "direction". "there will be a SCOTUS appointment this year" vs "there will be no SCOTUS appointments this year". for the exercise scott is doing, if he found his 50% predictions to score positively 30% of the time, he can't meaningfully correct it. the only way to make this exercise meaningful (as scott does it) is to add some kind of confidence, like 55% vs 45%, or to add/remove conditions until he can make a more confident prediction.
(this doesn't necessarily apply in other contexts, like if a prediction market's value is 35%, a 50% bet says the market is overconfident. or if a prediction has multiple expected outcomes like "biden/warren/bernie/hillary will win the democratic nomination". but for a scott-compared-with-scott exercise, it's mostly meaningless.)
"21. Google widely allows remote work, no questions asked: 20%"
As someone who didn't apply to work at Google recently because of their lack of commitment to remote work in the future, I think you've graded this one wrong. Maybe this is the case for current employees that are currently working remote, but it's not the policy for people looking to join Google in general.
I wondered at this as well. Google has basically let you work remotely no questions asked during the pandemic (except for when immigration laws get in the way), but that's very much not the case for the post-pandemic regime.
Seconding this for the same reason. Had a nibble late last year from one of their recruiters; when I asked about their remote policy, it was something like "we're allowing it for now to whomever on Covid grounds, but expect to return to the office as the pandemic recedes." Which was a dealbreaker for me.
If the question is about *right now* then yes, has postponed the RTO due to the omicron wave. And many employees informally assume there will be continued flexibility indefinitely.
But their official stance is absolutely that you are expected to return to the office (several days a week). (You can specifically request to be transferred to be a remote worker, but that's different).
Anyways, if the question was really intended as "in January of 2022, what will Googlers be doing", then sure, but I assume the intention was about Google's policy. And the answer there is clear cut (it could *change*, but Google's plans to return to the office are explicit).
Would have posted this. I finally received a reply from a Google recruiter just half an hour ago, after telling her earlier I would only apply if I could work remotely, and her answer was I can move to Pittsburgh, which is close to State College, a location my wife can work from without having to also change employers at the same time as me. It still doesn't seem they're willing to hire people who only want to work remote. And I don't particularly want to move to Pittsburgh.
It is always good to see concrete predictions being made and scored!
For fun, you could let Zvi see yours with explanation, you see Zvi's with explanation, you update with explanation, Zvi updates with explanation, you see Zvi's etc. etc. until you reach an endpoint. Then both look at the market and update. We could see if dialogue causes you to converge on the correct answers. But that would be pretty time consuming.
2014: Bitcoin will end the year higher than $1000: 70%
2015: Scott "Bitcoin will end the year higher than $200: 95%"
2016: Scott "Bitcoin will end the year higher than $500: 80%"
2017: Scott "Bitcoin will end the year higher than $1000: 60%."
2018: Scott "Bitcoin is higher than $5,000 at end of year: 95%. Bitcoin is higher than $10,000 at end of year: 80%. Bitcoin is lower than $20,000 at end of year: 70%."
*30. Some new variant where no existing vaccine is more than 50% effective: 40%*
How did this resolve positive? Are you (and were you) intending against detectable infection (my guess), transmissible infection, hospitalization, death? It might be worthwhile to explicitly mention (if) you believe that vaccines are 50% effective against death in-line; even though I hate the idea of enforcing hymn-like repetition of the party line, it may be worthwhile to disambiguate so people don't update in the wrong direction~
That isn't really my standard or the standard way I read Scott? I agree that I thought about it for literally 30 seconds, figured out what he meant, and moved on. The road to legibility isn't to make it so that a path leads to your meaning being understood, but to make it so all paths etc.
Idk if it's offensive for me to ask how calibrated you are about people understanding what you say to them - I am not a regular poster. You have read the sequences and still think I am overstating miscommunication, even given your own acknowledgement of how useful "context" is?
He's just asking Scott to do the simple thing of clarifying precisely what the question was (and perhaps even linking to a source for hte confirmation if at all questionable).
I'm asking "how many people will update towards 'the vaccine isn't worth it for some people'" in the world with the post as-is vs the post as-revised. Your initial response indicated indifference to the question, which is probably excusable but isn't a normal way I think. Whether or not the misunderstanding has impact in terms of lives (which I would not defend), I think there are epistemology reasons to clarify.
Yes, I agree this was phrased ambiguously. I decided to resolve it based on protection against infection, and found some studies showing vaccines were less than 50% effective against infection with Omicron.
The best sources I have seen indicate something like 70% protection from three doses (20% from two, and presumably how fresh the vaccination is matters here). So I would say the prediction should have been evaluated the other way.
Unless you want to get boosted every 3 months forever, there are now very strong reasons to believe current vaccines are well below 50% in preventing you infection from omicron.
There are roughly two hundred common cold viruses, so how do you feel about two hundred shots every three months?
OK, the two hundred viruses aren't uniformly distributed, and probably some of them are similar enough to each other to be covered by a single vaccine, so how do you feel about twenty shots every three months to probably avoid the common cold? And how does your immune system feel about being asked to provide that many antibodies on a continual basis?
And while I agree that Scott should have worded that prediction more clearly, I think the general understanding of vaccines is that, unless stated otherwise, they are intended to provide lasting immunity to infection.
As always, I think it would be much clearer if you phrased all your "predictions" so that they all have greater than 50% confidence. A prediction that something is 10% likely is actually a strong prediction that it *won't* happen.
A prediction of which team is 10% likely to win the NBA championships a decade hence is actually a fairly strong prediction that it *will* happen, since the baseline is quite a bit below 10%.
Well, I see what you're saying, but that doesn't really match with how he grades himself. Even given what you say, you would still bet against that team vs all the rest, since otherwise you would expect to lose money.
A minor suggestion, please use green text for "successful" predictions, red text otherwise and keep the bold/italic convention. Right now I can't just scan to get a sense of how right/wrong you were.
I don't think it's desireable to be able to quickly glance at it and see "ahh, majority green, he's usually right". If all his predictions were 90%, it could look mostly green and still be badly calibrated, and if all his predictions were 10%, it could look mostly red and still be well calibrated. It's probably a good thing to need to pause on each one and take a moment actually see the probability assigned to it.
You are making more than one such guess and the summation of them helps in terms of the calibration.
Secondly, it is a sort of value type position. You don't know the outcome of a given prediction, but you're still 'leaning' one direction or the other.
So it informs you about your biases in terms of which way you lean and if these trend one way or the other. If you oddly found that you got 70% or 30% of your 50% guesses correct, then you'd know to more strongly trust those types of reasoning, feels, etc. more or less. This makes them very useful since the entire purpose of the exercise is to calibrate your guesses.
The narrow focus on the value of a 50% guess on an individual statement....equally narrow in focus and misses the point of this entire exercise.
To exclude a 50% guess is more harmful in the calibration curve than including it. There is the tautological nature of a defined statement such that it is a prediciton phrased in one direction or the other. Obviously language can be used in ambiguous ways, but that's not the case here and where you make such errors or the outcomes themselves are unclear, you throw out such statements.
The value of such and such poll is ABOVE 50% is not 'truly' a 50% statement in a way since it is the opposite of such a statement where you say the number will be BELOW 50%. But again, let's not get too aspy here and fixate on meaningless trivialities while missing the point. The purpose isn't to craft vague statements true in many contexts, nor is it assign a value to activity of making an individual 50% guess devoid of any context where one might substitute the phrase 'I don't know' instead. That's not what is happening here.
The meaning and larger goal under which all items must fit into, the framework for the activity, is to run a calibration game using guesses and if your 50% guesses are right or wrong at a rate different to 50% across all such statements...then that tells you something you want to know.
What about the other two marked ambiguous? They also sounded concrete enough that it's surprising that they are not scored. Such predictions seem worth commenting on.
"53. At least 7 days my house is orange or worse on PurpleAir.com because of fires: 80%"
"106. I have a queue of fewer than ten extra posts: 70%"
Also, I think that there are errors in the aggregation. I count 26 questions at 20/80, but you score 20/23, which suggests that not only did you not count the medical records system question and orange/purple, but you also didn't count a question marked resolved.
So P(major rationalist org leaves Bay Area) = P(MIRI relocates to WA or NE) + P(MIRI relocates somewhere else) = 0.6. So doesn’t that imply that the probability of a major rationalist org besides MIRI leaving is 0%, since all of the probability mass is concentrated in MIRI?
Or less than 10%, since I'm binning these by 10%s, or MIRI is most likely to leave so in any world where other orgs leave MIRI has left too. But I agree I probably messed up there.
Why did you go for Oroxylum over other nootropics that are more popular and have similar purported effects? Phenylpiracetam, bromantane, modafinil (and similar non-prescription finils), ALCAR and nicotine gum/patches are a few off the top of my head that are considerably more common. Did you try several before ending up with Oroxylum? Or did the pharmacological properties of Oroxylum look particularly promising? Sorry for being so inquisitive, but I'm really interested since this is a relatively unknown product and you chose it over a number of other options.
None of the things you mentioned are dopamine reuptake inhibitors. There are not many OTC supplements that fall in that category. It seems Rhodiola rosea does to some extent, but it also has a smattering of other effects too. Modafinil does but it's primary effect is to work on orexin to boost wakefullness with a really long half life of like 12-15 hrs. A lot of nootropics are advertised / talked about as "legal Ritalin" but work through other mechanisms. So if you're looking for "legal Ritalin" this may be your best choice. The BDNF boosting effect seems like a nice extra effect too here, which may help with depression in addition to helping with memory consolidation. Note that whether Ritalin is a good idea is highly individual specific and also proper dosage is critical so you fall on the right spot on the Yerkes-Dodson curve.
Well, being a dopamine reuptake inhibitor isn't essential to being a stimulant. Nicotine is likely a more effective stimulant than Oroxylum (obviously without having tried it) despite only impacting dopamine indirectly. Bromantane is a dopaminergic drug, despite not being a DRI. Caffeine is the stereotypical stimulant, and also only has an indirect effect on dopamine.
Yeah there are other dopaminergic drugs but I don't know of many OTC nootropics that selectively increase dopamine (although I haven't looked extensively). I have heard of people responding differently to ritalin than adderal (which increases dopamine), but that's just anecdotal - not sure what the mechanism would be. Nicotine isn't really a pure stimulant BTW - it has some stimulant properties but also has some relaxing properties too. Nicotine seems to modulate a lot of neurotransmitters - restoring things that are out of range to in range.
I'm just saying that any definition of stimulant which excludes the #1 stereotypical stimulant, caffeine, isn't much of a definition. :) Ritalin isn't exclusively a dopamine reuptake inhibitor either, its strong effect on norepinephrine reuptake probably causes the heavy anxiety/motivation increasing effects compared to amphetamine (especially dextroamphetamine).
49 58 "We have a debate every year over whether 50% predictions are meaningful in this paradigm; feel free to continue it." Scott, this is why you are so wonderful and make me happy when I read your stuff. keepin it real
Let’s assume you make 3 70% predictions of things for next year: 1. Aliens will not invade Earth next year 2. Greuther Fuerth (soccer team) will not win Bundesliga (they are currently last) 3. Bitcoin will cost more than 10 million usd. If btc won’t go to the Moon and aliens to Earth your calibration will show that you got 66,7% right for 70% bracket. But in fact all 3 of your predictions would be terrible. 70% for two almost certain thing and 1 almost impossible. Another way how predictions can be bad but well calibrated - if they are meaningless- for example 1. Kanye West will win USA presidency 0-10% 2. I will play “head or tails” and will get “ tails” 50%. Can 50% predictions be good and valuable? Of course they can - if you know that there is 50% chance Bitcoin will cost more than 200 000$ usd this is very valuable knowledge and prediction. If you think that there is 50% chance that weak team will win against huge favourite you can also win money on that. If you are not pursuing money but public intellectual reputation mechanics are still the same - for prediction to be valuable it’s have to be matched against both reality and opinions of other people. So how to do this early prediction thing much better - 1. make a list of thing to predict like the one here (1. Biden approval rating (as per 538) is greater than fifty percent etc) but without probabilities 2. Then ask in the survey the audience of your blog to give their probabilities of those things. 3. Then write your predictions. 4. In the end of the year for every difference in yours and average audience prediction assume a virtual bet with odds in the middle of two predictions. For example if you think something is 70% and audience thinks it’s 50% then with middle of 60% odds for your bet are 1,8. 5. Calculate ROI/winrate of your predictions against blog audience.
Isn't your prescription... exactly what happened at the end of this post...? Not against his readers, but he compared his success against Zvi and the market, which seems like exactly what you want (and with a scoring system that avoids a lot of the complexities you describe).
Also the idea that calibration is gameable by putting 70% on two sure things and one low probability thing is true, but only if Scott's goal is "show off how well calibrated I am". I, and I suspect most readers, have enough faith to believe his goal is to compete with himself and improve, not to fake success. As long as he (or anyone making these predictions) is being honest about their beliefs, the gameability shouldn't be an issue. Or at least it should be pretty unlikely, it'd come down to dumb luck to have 30% of your 70% predictions be extremely unlikely and 70% to be extremely likely.
Not exactly. He's comparing himself there to an expert forecaster and an entire market of experts, that you have good reason to think are going to be more accurate than totally uninformative picks. What you want to do is compare your predictive success to a base rate. The base rate isn't other experts. It's an uninformed prediction. You see this often with machine learning model scoring, where accuracy rates are compared to a model that selects randomly or without training.
In this case, something like "relations between county X and county Y will be worse than each of the last five years" would have an uninformed base rate of 1 in 6. "Team X will win the Super Bowl" is 1 in 32. "President X has approval over 50%" should probably just be 50/50 if you assume with a complete lack of information that approval is uniformly distributed, but you might skew to match some historical average of all presidential approval ratings, which is probably not 50%.
But the basic point is to compare your performance to a predictor that isn't really trying, not to compare yourself to other experts. You can still do that, but their scores should also be similarly adjusted.
I mean, yeah, we're able to work it out with enough effort. Doesn't mean it's a bad idea to reduce the amount of effort required (and head off some actual misunderstandings at the same time).
Works for French readers too. Pedantic point: the ISO standard is very complex and you can't access it without paying. However, you can use RFC 3339 which is a simple subset of that. Ideally, use YYYY-MM-DD. The "-" and year before will signal to people that it's the ISO/RFC format.
Actually, it's not true that the standard is complex and you can't access it without paying. The whole point of a standard is to get everyone to use it, which requires that they know what it is, which they will not if it is complex and you have to pay to find out what it is. Therefore the "ISO standard" is not actually a standard.
Similarly, the actual C programming language standard is the supposed "draft" that can be downloaded for free, not the "official" version that you have to pay for.
Of course, despite this, everyone should use YYYY-MM-DD format for dates. Writing a date like 10/11/12 is ridiculous.
You can very easily use the C programming language standard (the actual one, not the draft) without paying for it, or even reading it. You just need to use a compiler that adheres to it. (And ideally one that gives you decent messages when you write non-standard code so you can fix the problem more easily.)
You rely hundreds of times a day on many other standards which are neither simple nor free to access, from your power plugs to the buildings you enter to the roads you use. Simply plugging a phone charger into a wall and your phone relies on many dozens of standards, some of which are pretty darn complex.
And yes, everybody should use YYYY-MM-DD, regardless of arguments about what a standard is or isn't, for a whole host of good reasons.
A standard needs to be known by those who use it. For power plugs, those would be the people who make plugs, and the people who make outlets, who may be willing to pay to read the standard. The rest of us just need to know how to use these plugs, not the exact distance between the prongs.
But everyone reads and writes dates.
And every C programmer needs to know how the language works. C is not a language where "it worked when I tried it with gcc 9.3 on an ARM v7 processor" is any sort of guarantee that it will work with another compiler on another processor. And if a compiler writer noticed that the "draft" and "official" standards differed in some meaningful way, they would be well advised to implement the version in the "draft", since that's what the programmers read.
Everyone reads and writes dates, not everyone reads and writes ISO dates. The reason I mentionned RFC 3339 is that it is a standard that you can easily adhere to. Precisely, the YYYY-MM-DD is a subset of both ISO 8601 and RFC 3339, meaning that you're compatible with the "standard as in regulation" way to do dates, and you're also accessible to most people in the world.
Not everybody needs to or should be writing full ISO timestamps. The YYYY-MM-DD part of the ISO standard is just fine for the majority of situations, and you don't need to know the rest of the standard to do that.
If you're using "it worked" as opposed to turning on and reading the warnings from your compiler, you're so likely to make an error you miss when you think you're following the standard that the standard will make little difference. And no, compiler writers should not implement errors in the draft that have been fixed in the actual standard.
There are numerous aspects of C that you need to know but which you will not find out about from paying attention to compiler warnings. To pick two examples from many - a variable declared as "char" may be either signed or unsigned, depending on the implementation, and overflow in an arithmetic operation on signed integers is "undefined behaviour" (anything can happen, often without a warning), whereas overflow in operations on unsigned integers is well-defined (modular arithmetic).
(This is in a separate comment because I have the feeling that any comments with URLs are held for spam prevention, and perhaps even lost from time to time.)
That is why I only talked about "the ISO standard" and not "the standard". It is a fact that the ISO standard for dates (ISO 8601) is complex, and it is a fact that you (or someone) needs to pay to access it.
> Similarly, the actual C programming language standard is the supposed "draft" that can be downloaded for free, not the "official" version that you have to pay for.
I don't think this is true. If you sell a "standard-compliant" compiler, you'll have to buy the standard, and you can't rely on the drafts. I don't pay for the standards, so I'm not aware of any differences with the drafts. I hope there aren't any, since lots of people rely on them. But that's just how I, someone that doesn't use C for business, does stuff. If we use C at work, you can be sure that I'll ask for my employer to pay for the standard.
When it comes to thing over which you have some control, like personal projects, how much overlap is there between predicting and planning? For example 96,98,100,103-105; when you were making those predictions, were you setting priorities for the year?
Also, did you feel more pressured to follow through on things you rated as more likely to do, and less pressure on things that you didnt really think you were gonna get to anyways?
Personal predictions are meaningless. I am about to cross the street--prediction: I have crossed the street by tomorrow.. 50% predictions are also meaningless. I was ridiculed here for astrology: I have way fewer predictions published with timestamps on world affairs but all but one has turned out to be correct. What do you base your predictions on?
Also astrologers expected to be 100% correct every time all the time and since they are not the whole discipline is dismissed. Amazing how many here got seriously involved with these predictions never questioning epistemology.....
Not at all. I would be interested in seeing you log your predictions the way Scott is doing here.
I agree that Scott should probably avoid predicting things that are in his control, or if its personally interesting to him, he should score it separately .
If you have a web page where you document your predictions I'd be interested in seeing it.
personal predictions are not useful as evidence for/against someone’s prowess as a forecaster. but they can be useful in finding out how well you know yourself. of course, if you cheat, you get less out of it, but why would you cheat if you’re doing it for your own benefit anyway?
Yeah a lot of the criticisms, both for personal predictions and for 50% predictions, only apply if you assume Scott is acting in bad faith. As long as he's being honest with his predictions, and is primarily trying to learn rather than trying to prove something, both are totally fine.
I think the problem with Scott's approach is that he lumps local predictions (personal, work, blog, community) together with non-local predictions (politics, econ/tech, COVID-19). The one set of predictions tells him how well he knows and can predict things around him. The other set tells him how good of a handle he has on news-like stuff. One he is intimately involved in, the other he's mostly a spectator.
There's a failure mode that lumps together things that aren't related and makes faulty conclusions based off of hidden assumptions. I think putting a number on a thing (X% probability of Y happening) makes that failure mode easier to fall into, because it hides your assumptions. You see two numbers, both probabilities, and assume you can lump them together since they now have the same format. But a prediction about the date of your future wedding is fundamentally different from a prediction about the price of Bitcoin. It's not a question of whether you're predicting in good faith, but about whether you've accidentally fooled yourself by hiding your assumptions from yourself.
That will then lead you to calibrate based on faulty assumptions. Garbage-in/garbage-out, as it were. Here, Scott has ~30% predictions that are public/non-local, versus ~70% that are personal/local. I'm not sure he has enough predictions in either sub-set to do the separate calibrations well, making the exercise difficult to draw conclusions from. I would certainly not be confident in this kind of calibration exercise.
You're making a huge leap here from the obviously true [[these two styles of predictions involve non-identical skill sets and could have very different underlying calibration level]] to [[these two styles of predictions are so fundamentally different that they don't tape any skills]]. I agree that it's totally plausible to be well calibrated on one and badly calibrated on the other, but that doesn't make bucketing them together useless... "garbage-in/garbage-out" is way too harsh, it's something much more akin to "imperfect inputs, be careful what you do with your outputs"
And the obvious solution to this is to also check calibration on the buckets separately. I suspect Scott does this. But then you also say that's useless because he doesn't predict enough things? You'll always be able to slice the buckets more, e.g. "you can't lump politics predictions with medical predictions!" or "you can't lump predictions about your own wedding with predictions about your friends' relationships!". It seems like "this exercise is useless if you slice things in ways I don't like" and "this exercise is useless if you have buckets that are smaller than I like" will make this perpetually impossible.
But I agree with a scaled back version of your claim, if Scott hasn't bothered also looking at his calibration on the buckets separately, that seems like a useful exercise.
You're right that I'm a bit too overzealous in my reply. I'll take the correction.
We can test the hypothesis, "non-local predictions are fundamentally different from local predictions" by breaking down the predictions into the two buckets. I already predicted the N is probably not large enough to make any definite conclusions from them, and looking at the numbers (~30 predictions vs. ~70 predictions) that's probably true. And although we could probably sum up across multiple years, that's more time than I want to invest. I'm willing to break down one year of predictions, so let's call this a starting point:
Local predictions (community, personal, work, blog)
50% || 57%
60% || 47%
70% || 71%
80% || 100%
90% || 89%
95% || 100%
It looks like Scott is much better calibrated at predicting personal events than at predicting non-local events. He has a lot of imprecision in his ability to forecast non-local events that appears to be masked by the much larger number of local events he predicts (>2x more).
Again, I'm not claiming he's doing this intentionally to try and make his results look more calibrated. I think he's honestly trying to improve his ability to assign probability to future events. I'm just saying that it's easy to fool yourself with false assumptions, and that assigning a probability estimate to a thing can introduce false assumptions without you realizing what you're doing.
You're right that it's possible to create different categories ad infinitum, but that's attributing to me an intent to discredit Scott's approach in a way that's uncharitable and not what I was trying to do. I'm not interested in an infinite sub-division of predictions. For me, the idea of tracking the calibration matters because of the predictions themselves. I care about the predictions and understanding my uncertainty, not about whether I can get numbers on a graph to line up with a low variance from an expected trendline. In that sense, it's perfectly valid to sub-divide when lumping predictions together hides my ability to understand predictions and uncertainty. Sub-dividing is necessary to answer questions like:
"Do I have a realistic assessment of whether my goals for next year are achievable?"
"Am I overconfident in my understanding of the things I read about in the news?"
"Do I understand the relationships around me as well as I think I do?"
Those are all separate questions. And making a prediction like, [friend has gotten a job] does a lot to answer one question and nothing to answer another. Lumping 100 predictions together can easily bury the signal with a bunch of unrelated noise.
If the question is, "Will there be a major conflict between the US and Russia?" my interest in being well-calibrated on predicting that kind of question is fundamentally different than for "[friend moves back to Indiana]". Those two predictions are obviously unrelated. Yet Scott lumped those two questions together and applied a mathematical formula to them. I did not. I questioned whether it was wise. And, as you pointed out, went on to declare with equal zeal that I have no confidence in that approach.
Scott. Considering you've been doing this a while, have you graded the accuracy of your predictions over time by the thematic baskets? It would be interesting to know if you're notably better (or even significantly better) at predicting things about your friends than the economy or meta rather than yourself, and this would presumably highlight strengths and weaknesses that might help improve future predictions?
I look forward to this year's prediction of the likelihood of you remaining married with voyeuristic glee! Although possibly, assuming your wife reads the predictions, giving remaining married a high probability would in fact make it more likely to happen?
I would argue that all of the conflict ones should be graded true. Russia has over a hundred thousand troops on the border openly, so that’s quite the intimidation tactic regardless of shots fired, same with China’s flights at the edge of Taiwan air defense. Israel had the largest number of rocket attacks in at least several years, if not the decade plus in 2021, though that has been quieter recently after the retaliation strikes.
I think large numbers of troops (smaller earlier in the month) around Ukraine is still less of a flare-up than 2014, with the annexation of Crimea and open warfare in the Donbass. Likewise, there doesn't seem anything exceptional about the 2021 upswing in violence between Israel and the Palestinian militants compared to previous years (certainly Israel has been more engaged in the past).
Taiwan doesn't seem as clear cut, but I guess Scott's assessment criteria were for escalation of actions rather than just increased levels of the same actions as before.
So do I take from this that you expect Andrew Yang to be mayor of New York at some undetermined date in the future? Because this time round, I didn't rate his chances at all (and this seems to have been borne out), and by the bitter complaining about racist political cartoons, I don't think he'll run again or at least not soon.
In fact because the NYC mayor's term runs to the end of the year, that was a terrible prediction: Even if Yang had won, it most likely would have been false.
No, but I would call it worse than a referendum causing a border shift.
(Yes, I realize that voter turnout percentage is likely fudged and that there was Russian military presence, but polling for years showed that a referendum would likely favor Crimea joining Russia.)
For curiosity's sake, in which way 84 "I have switched medical records systems" resolve ambiguously? Mid-transition? Changed to a later version with significant redesign but same vendor? Uses the old system but grumbles more and does some of the work on paper? Or what?
Scott, maybe you shouldn't include predictions on something you can personally influence to a significant degree in your set. Or at least you should categorize them and judge their reliability in a totally different category, apart from others. Because this lends itself to "hacking", and a smart guy like you can "hack" his life enough that he gets precisely the mix of results he needs.
Look, 50% predictions are fine, the only question is: what do you learn if you are miscalibrated at 50%? It depends on how you determine which way to word the prediction ("X will happen" vs "X won't happen").
- If you determine it by flipping a coin, you learn nothing. Or maybe that your coin is biased.
- If you determine it by asking a friend, you learn something about your friend.
- If you phrase it positively rather than negatively, you learn that "things" are more likely to happen than you think. This is pretty vague.
IMO the obvious solution is to treat it as 50+epsilon: if you HAD to pick, would you pick X or not X? Then if you get 90% right, you learn that you are underconfident, just like every other bucket.
I've advocated binning inverse predictions before for lists from prediction markets, to improve readability in those lists and avoid similar things getting sorted far away just because of phrasing.
For reading lists of pending predictions I think it's really helpful.
For calibration I have a small worry that it could obscure some biases. If you are underconfident at 80% and overconfident at 20% (you hedge towards coinflips), those would wash. Even if you are just underconfident at 80%, a well calibrated 20% will still dilute that signal.
However, binning might be necessary. Maybe you just don't have enough predictions in a bin to analyze otherwise.
In which case... eh, bin, probably fine? You just need to do occasional spot checks to ensure there's no weird underconfidence or overconfidence below or under the 50% mark. (You could normalize your initial predictions so that they are only above 50% in the first place, dodging this whole issue from the beginning.)
If you find your bins are consistently too thin, I wonder if you could just group a few years together, at the risk of obscuring any changes in prediction style year to year.
PS - I'm waiting for the reply "yes, this was all discussed and resolved back in the comment threads years ago..." I'm at maybe 70% that Scott and all the smart people here have probably been down this road before. If so, sorry for retreading.
Wait, I thought #55 happened? (And I need to add to the congratulations?)
Not in 2021, I don't think
You’re right.
Presumably the wedding was scheduled in 2022 to help improve the calibration score.
I can totally see that happening.
Except that he was overconfident in the 80% bin, and it was a 20% prediction, so getting it wrong would have improved his calibration.
"as of 1/1/22"
You’re right also. I’d delete my post but I’ll just own the embarrassment.
Conspiracy theory: he delayed the wedding to make his prediction score better
This was my thought as soon as I read that prediction. Scott, we are on to you.
I guess if predictions are within your will to alter, it's not a great way to determine a brier score.
Or within your will to altar, in this case
clever
I'd unsubscribe in frustration if for that reason he had also resolved as false: "I post my scores on these predictions before 3/1/22"
Yeah this is like when I made a prediction about "will I have kissed (the person I was romantically interested in) by a particular date", and then... I was off by exactly one day. Grr.
I was about to complain that you resolved the prediction results prediction wrong but Murka.
"Biden approval rating (as per 538) is greater than fifty percent: 80%"
Why were you so confident in this prediction? Maybe I missed a post where you explained.
Every modern president except for Ford, Reagan, and Trump has had >50% positive approval rating at the end of their first year.
Biden should also get a lift from the post-COVID recovery.
Lol
Makes sense!
On the other hand, Biden barely started his term above 50%, and in April when these predictions were made he was hovering in the 53% range.
Combined with the general downwards trend of Presidential approval ratings through their term it shouldn't be surprising for him to lose 3% over eight months.
(Actually he lost 10%, which is more surprising)
With the growing trend of polling stability for presidents, presidents fall off less than they used to. I think Trump lost more than 3% from the beginning of his presidency, but I'm pretty sure that between June 2017 and November 2020, Trump was entirely within a 3% range (except maybe for a couple days around an exceptional event or two).
I am not a fan of Biden at all, but given the tribal nature of US politics, I'd also predict the same at the beginning of the year. The kind of people that most pollers talk to usually lean slightly to the left, and having freshly elected Democratic president over 50% would not surprise me. Of course, that was before I witnessed what Biden administration actually looks like. I expected Biden to be bad, but regular level bad, and that'd keep him over 50% with tribal sentiment, some independent goodwill and the composition of the polls. But it turned out to be much worse than that, so that even the tribal sentiment couldn't hold him above water.
He's just your normal Obama style neoliberal ghoul, but he can't even throw us some lines about Hope and Change and such.
What a buzzkill.
I was going to say too bad about #96 because I was going to give a friend the book as a gift, then decided to just point them at the webiste.
What do you think about how Manifold.markets has evolved since you first covered it? Will you be posting anything on it for this year's predictions?
(Manifold Markets was briefly known as Mantic Markets)
Haha, thanks for asking, Lars! We'd love to host Scott's predictions on Manifold Markets, so that the ACX audience can come and directly place their own bets.
I haven't been following it as much as I should. I'm checking it right now and see a 17% chance that Trump will be president next month, which suggests something has gone wrong.
Well, someone is throwing away money/trying to monetarise their involvement in a coup perhaps?
I mean I will offer 10% odds for quite a lot of money if they want to do that, come find me.
There's two main challenges I see Manifold having right now:
1) "Trump has a 17% chance of being president next month"
The main thing with manifold market is its design basically lets anyone post a prediction, and empowers that same user to rule on its resolution. So for that particular prediction, pretty much nobody believes that's going to happen, but what you're seeing is essentially a bet that *that particular user* is going to close the bet honestly. Based on some comments the user has made a lot of people are somewhat skeptical.
It'll be interesting to see how this develops long term as certain users accrue a site reputation for resolving issues honestly and others don't. The main critique I've seen from skeptics about manifold's approach is that you can't trust people to resolve their own markets, but manifold presumably thinks it can work once you run it long enough and get it enough scale and get the reputation and community features in.
2) Not enough incentive/liquidity to push odds right to the edges
If I post a question on Manifold like "Will the sun explode tomorrow," with 24 hours to resolve, chances are it will only go down to about 5%. That's not because people believe the Sun still has a chance to explode, but because there's not enough incentive at the margins to push the question to 0% with a negative bet. You get better returns on investment for making contrarian bets, so you have to tie up a lot of capital for a small return just to push the remaining marginal percentage all the way down, and there's better things you could spend your fake money on. So even though there's a bit of "free money" left on the table for a sure bet like the sun not exploding tomorrow, it's not quite worth it to tie up the capital.
I wonder how much of this is a liquidity/volume problem, and how much is an incentives issue. Part of it is certainly a time preference thing -- if there's a sure bet ("will the sun explode next year") trading at 5%, then I'm even less likely to try to drive it down to zero because I don't want my capital tied up for a whole year just to get a meager return.
Nevertheless, I think Manifold is pretty fun and I enjoy the site and I'm willing to follow along to see if they can figure all this stuff out. It's become part of my daily site checking habit, which is a rare thing to be able to say.
As larsiusprime said, that is the chance the creator, Dr P, would be dishonest. We also have a derivative market: https://manifold.markets/RavenKopelman/will-dr-ps-question-about-trump-bei
However, surfacing any internet person's market is still a bit of a problem, because dishonest market creators could dupe some into losing money.
We are almost ready with our solution which is: curated subsets of markets! Anyone will be able to form a sub-community on a selection of markets, usually on a particular topic.
Each sub-community gets its own social feed and leaderboards. And, they will be able to exclude bad faith actors, so the markets you see will be clear and accurate.
I'm really excited about the future of Manifold Markets! I think we can make forecasting ubiquitous through ease of use and solid incentive structures.
Nixonland when
What I do with my predictions, to solve the 50% issue, is to predict a probability of 49% or 51% but never 50%. (my predictions are at: https://pontifex.substack.com/p/predictions-for-2022 )
When I score these at the end of the year I'm going to put them in buckets, 10 points wide. E.g. there'll be a 20-30% bucket. If I make 3 predictions at 20%, 4 at 24% and 5 at 30%, then I expect 3*20+4*24+5*30 = 306%, so about 3 of them should come true. The reason I plan to do it this way is so that I don't have to restrict myself to %ages that are multiples of 10. E.g. I might have an event with a 3% probability. I may write some Python software to draw pretty graphs of the results; if so I'll put it on Github so others can use it.
Why can't you just do 50% scores this way?
binary predictions don't have a built-in "direction". "there will be a SCOTUS appointment this year" vs "there will be no SCOTUS appointments this year". for the exercise scott is doing, if he found his 50% predictions to score positively 30% of the time, he can't meaningfully correct it. the only way to make this exercise meaningful (as scott does it) is to add some kind of confidence, like 55% vs 45%, or to add/remove conditions until he can make a more confident prediction.
(this doesn't necessarily apply in other contexts, like if a prediction market's value is 35%, a 50% bet says the market is overconfident. or if a prediction has multiple expected outcomes like "biden/warren/bernie/hillary will win the democratic nomination". but for a scott-compared-with-scott exercise, it's mostly meaningless.)
If you're looking to join a D&D campaign, we can make that happen. We're in the same timezone! :D CSLewin@gmail.com
"21. Google widely allows remote work, no questions asked: 20%"
As someone who didn't apply to work at Google recently because of their lack of commitment to remote work in the future, I think you've graded this one wrong. Maybe this is the case for current employees that are currently working remote, but it's not the policy for people looking to join Google in general.
I wondered at this as well. Google has basically let you work remotely no questions asked during the pandemic (except for when immigration laws get in the way), but that's very much not the case for the post-pandemic regime.
Seconding this for the same reason. Had a nibble late last year from one of their recruiters; when I asked about their remote policy, it was something like "we're allowing it for now to whomever on Covid grounds, but expect to return to the office as the pandemic recedes." Which was a dealbreaker for me.
Yeah I don't understand Scott's phrasing.
If the question is about *right now* then yes, has postponed the RTO due to the omicron wave. And many employees informally assume there will be continued flexibility indefinitely.
But their official stance is absolutely that you are expected to return to the office (several days a week). (You can specifically request to be transferred to be a remote worker, but that's different).
Anyways, if the question was really intended as "in January of 2022, what will Googlers be doing", then sure, but I assume the intention was about Google's policy. And the answer there is clear cut (it could *change*, but Google's plans to return to the office are explicit).
+1 as a Google employee, you can work remote but questions are very much asked (and you need explicit approval).
Would have posted this. I finally received a reply from a Google recruiter just half an hour ago, after telling her earlier I would only apply if I could work remotely, and her answer was I can move to Pittsburgh, which is close to State College, a location my wife can work from without having to also change employers at the same time as me. It still doesn't seem they're willing to hire people who only want to work remote. And I don't particularly want to move to Pittsburgh.
Pittsburgh is also not all that close to State College.
Pittsburgh is close to State College in the same sense that Los Angeles is close to San Diego.
That said, Pittsburgh is excellent and you should give it a fair shake.
Agreed. I moved to Pittsburgh from the Bay Area and it's awesome.
It would be a lot clearer if you binned the predictions by confidence first and then graded them.
Genuinely sad about the "non-unsong book" one, I absolutely love Unsong and really hope you keep writing books. Impressive results though, nice!
What's your exercise routine?
It is always good to see concrete predictions being made and scored!
For fun, you could let Zvi see yours with explanation, you see Zvi's with explanation, you update with explanation, Zvi updates with explanation, you see Zvi's etc. etc. until you reach an endpoint. Then both look at the market and update. We could see if dialogue causes you to converge on the correct answers. But that would be pretty time consuming.
Sorry if this is a personal question, but I'm curious what you take oroxylum for and whether you think it's helpful.
2014: Bitcoin will end the year higher than $1000: 70%
2015: Scott "Bitcoin will end the year higher than $200: 95%"
2016: Scott "Bitcoin will end the year higher than $500: 80%"
2017: Scott "Bitcoin will end the year higher than $1000: 60%."
2018: Scott "Bitcoin is higher than $5,000 at end of year: 95%. Bitcoin is higher than $10,000 at end of year: 80%. Bitcoin is lower than $20,000 at end of year: 70%."
2019: Scott "Bitcoin above 1000: 90%. Bitcoin above 3000: 50%. Bitcoin above 5000: 20%"
2020: Scott "Bitcoin is above $5,000: 70% …above $10,000: 20%"
2021: Scott "Bitcoin above 100K: 40%"
How did the oroxylum work out? Does it feel at all subjectively similar to other dopamine reuptake inhibitors?
What is oroxylum for?
*30. Some new variant where no existing vaccine is more than 50% effective: 40%*
How did this resolve positive? Are you (and were you) intending against detectable infection (my guess), transmissible infection, hospitalization, death? It might be worthwhile to explicitly mention (if) you believe that vaccines are 50% effective against death in-line; even though I hate the idea of enforcing hymn-like repetition of the party line, it may be worthwhile to disambiguate so people don't update in the wrong direction~
That isn't really my standard or the standard way I read Scott? I agree that I thought about it for literally 30 seconds, figured out what he meant, and moved on. The road to legibility isn't to make it so that a path leads to your meaning being understood, but to make it so all paths etc.
Idk if it's offensive for me to ask how calibrated you are about people understanding what you say to them - I am not a regular poster. You have read the sequences and still think I am overstating miscommunication, even given your own acknowledgement of how useful "context" is?
He's just asking Scott to do the simple thing of clarifying precisely what the question was (and perhaps even linking to a source for hte confirmation if at all questionable).
I'm asking "how many people will update towards 'the vaccine isn't worth it for some people'" in the world with the post as-is vs the post as-revised. Your initial response indicated indifference to the question, which is probably excusable but isn't a normal way I think. Whether or not the misunderstanding has impact in terms of lives (which I would not defend), I think there are epistemology reasons to clarify.
Appreciate your not taking offense!
I did just update my comment to clarify that I read it the same way you did
I agree that taking some effort to avoid the illusion of transparency would be worth it!
Three doses Pfizer/Moderna seems to give about 70% infection protection. So no, I don't get that one either.
Yes, I agree this was phrased ambiguously. I decided to resolve it based on protection against infection, and found some studies showing vaccines were less than 50% effective against infection with Omicron.
The best sources I have seen indicate something like 70% protection from three doses (20% from two, and presumably how fresh the vaccination is matters here). So I would say the prediction should have been evaluated the other way.
Unless you want to get boosted every 3 months forever, there are now very strong reasons to believe current vaccines are well below 50% in preventing you infection from omicron.
I would be perfectly fine with that, tbh (barring heretofore unknown complications, naturally).
Yeah. Like if I could take a shot every three months and never get a cold again sign me up.
There are roughly two hundred common cold viruses, so how do you feel about two hundred shots every three months?
OK, the two hundred viruses aren't uniformly distributed, and probably some of them are similar enough to each other to be covered by a single vaccine, so how do you feel about twenty shots every three months to probably avoid the common cold? And how does your immune system feel about being asked to provide that many antibodies on a continual basis?
And while I agree that Scott should have worded that prediction more clearly, I think the general understanding of vaccines is that, unless stated otherwise, they are intended to provide lasting immunity to infection.
What if that shot made you sick for a few days?
As always, I think it would be much clearer if you phrased all your "predictions" so that they all have greater than 50% confidence. A prediction that something is 10% likely is actually a strong prediction that it *won't* happen.
A prediction of which team is 10% likely to win the NBA championships a decade hence is actually a fairly strong prediction that it *will* happen, since the baseline is quite a bit below 10%.
Well, I see what you're saying, but that doesn't really match with how he grades himself. Even given what you say, you would still bet against that team vs all the rest, since otherwise you would expect to lose money.
I basically did this in the scoring, it's just semantic, and from a semantic point of view I thought would be confusing to have predictions like:
US does not win the Olympics: 90%
Russia wins the Olympics: 60%
India does not win the Olympics: 99%
A minor suggestion, please use green text for "successful" predictions, red text otherwise and keep the bold/italic convention. Right now I can't just scan to get a sense of how right/wrong you were.
I don't think it's desireable to be able to quickly glance at it and see "ahh, majority green, he's usually right". If all his predictions were 90%, it could look mostly green and still be badly calibrated, and if all his predictions were 10%, it could look mostly red and still be well calibrated. It's probably a good thing to need to pause on each one and take a moment actually see the probability assigned to it.
50% is fine and useful.
You are making more than one such guess and the summation of them helps in terms of the calibration.
Secondly, it is a sort of value type position. You don't know the outcome of a given prediction, but you're still 'leaning' one direction or the other.
So it informs you about your biases in terms of which way you lean and if these trend one way or the other. If you oddly found that you got 70% or 30% of your 50% guesses correct, then you'd know to more strongly trust those types of reasoning, feels, etc. more or less. This makes them very useful since the entire purpose of the exercise is to calibrate your guesses.
The narrow focus on the value of a 50% guess on an individual statement....equally narrow in focus and misses the point of this entire exercise.
To exclude a 50% guess is more harmful in the calibration curve than including it. There is the tautological nature of a defined statement such that it is a prediciton phrased in one direction or the other. Obviously language can be used in ambiguous ways, but that's not the case here and where you make such errors or the outcomes themselves are unclear, you throw out such statements.
The value of such and such poll is ABOVE 50% is not 'truly' a 50% statement in a way since it is the opposite of such a statement where you say the number will be BELOW 50%. But again, let's not get too aspy here and fixate on meaningless trivialities while missing the point. The purpose isn't to craft vague statements true in many contexts, nor is it assign a value to activity of making an individual 50% guess devoid of any context where one might substitute the phrase 'I don't know' instead. That's not what is happening here.
The meaning and larger goal under which all items must fit into, the framework for the activity, is to run a calibration game using guesses and if your 50% guesses are right or wrong at a rate different to 50% across all such statements...then that tells you something you want to know.
>I have switched medical records systems
Curious to know what made this ambiguous
I just forgot to italicize it, it didn't happen.
What about the other two marked ambiguous? They also sounded concrete enough that it's surprising that they are not scored. Such predictions seem worth commenting on.
"53. At least 7 days my house is orange or worse on PurpleAir.com because of fires: 80%"
"106. I have a queue of fewer than ten extra posts: 70%"
Also, I think that there are errors in the aggregation. I count 26 questions at 20/80, but you score 20/23, which suggests that not only did you not count the medical records system question and orange/purple, but you also didn't count a question marked resolved.
I’m confused about 33-36:
“33. Major rationalist org leaves Bay Area: 60%
34. MIRI relocates to Washington State: 20%
35. MIRI relocates to New England: 20%
36. MIRI relocates somewhere else: 20%”
So P(major rationalist org leaves Bay Area) = P(MIRI relocates to WA or NE) + P(MIRI relocates somewhere else) = 0.6. So doesn’t that imply that the probability of a major rationalist org besides MIRI leaving is 0%, since all of the probability mass is concentrated in MIRI?
Or less than 10%, since I'm binning these by 10%s, or MIRI is most likely to leave so in any world where other orgs leave MIRI has left too. But I agree I probably messed up there.
Are you ever going to tell us what oroxylum is supposed to do and/or actually does?
Mild stimulant, I think of it as the supplement version of Ritalin. See https://en.wikipedia.org/wiki/Oroxylin_A
Why did you go for Oroxylum over other nootropics that are more popular and have similar purported effects? Phenylpiracetam, bromantane, modafinil (and similar non-prescription finils), ALCAR and nicotine gum/patches are a few off the top of my head that are considerably more common. Did you try several before ending up with Oroxylum? Or did the pharmacological properties of Oroxylum look particularly promising? Sorry for being so inquisitive, but I'm really interested since this is a relatively unknown product and you chose it over a number of other options.
None of the things you mentioned are dopamine reuptake inhibitors. There are not many OTC supplements that fall in that category. It seems Rhodiola rosea does to some extent, but it also has a smattering of other effects too. Modafinil does but it's primary effect is to work on orexin to boost wakefullness with a really long half life of like 12-15 hrs. A lot of nootropics are advertised / talked about as "legal Ritalin" but work through other mechanisms. So if you're looking for "legal Ritalin" this may be your best choice. The BDNF boosting effect seems like a nice extra effect too here, which may help with depression in addition to helping with memory consolidation. Note that whether Ritalin is a good idea is highly individual specific and also proper dosage is critical so you fall on the right spot on the Yerkes-Dodson curve.
Well, being a dopamine reuptake inhibitor isn't essential to being a stimulant. Nicotine is likely a more effective stimulant than Oroxylum (obviously without having tried it) despite only impacting dopamine indirectly. Bromantane is a dopaminergic drug, despite not being a DRI. Caffeine is the stereotypical stimulant, and also only has an indirect effect on dopamine.
Yeah there are other dopaminergic drugs but I don't know of many OTC nootropics that selectively increase dopamine (although I haven't looked extensively). I have heard of people responding differently to ritalin than adderal (which increases dopamine), but that's just anecdotal - not sure what the mechanism would be. Nicotine isn't really a pure stimulant BTW - it has some stimulant properties but also has some relaxing properties too. Nicotine seems to modulate a lot of neurotransmitters - restoring things that are out of range to in range.
I'm just saying that any definition of stimulant which excludes the #1 stereotypical stimulant, caffeine, isn't much of a definition. :) Ritalin isn't exclusively a dopamine reuptake inhibitor either, its strong effect on norepinephrine reuptake probably causes the heavy anxiety/motivation increasing effects compared to amphetamine (especially dextroamphetamine).
I'm curious too. This is the first I've heard of Oroxylum. Replying to "subscribe" to this comment
Would you be willing to mention which brand you think is good?
I've heard good things about NootropicsDepot in general (although they tend to be more expensive).
you can grow your own if you're in USDA zone 10 or 11
49 58 "We have a debate every year over whether 50% predictions are meaningful in this paradigm; feel free to continue it." Scott, this is why you are so wonderful and make me happy when I read your stuff. keepin it real
Let’s assume you make 3 70% predictions of things for next year: 1. Aliens will not invade Earth next year 2. Greuther Fuerth (soccer team) will not win Bundesliga (they are currently last) 3. Bitcoin will cost more than 10 million usd. If btc won’t go to the Moon and aliens to Earth your calibration will show that you got 66,7% right for 70% bracket. But in fact all 3 of your predictions would be terrible. 70% for two almost certain thing and 1 almost impossible. Another way how predictions can be bad but well calibrated - if they are meaningless- for example 1. Kanye West will win USA presidency 0-10% 2. I will play “head or tails” and will get “ tails” 50%. Can 50% predictions be good and valuable? Of course they can - if you know that there is 50% chance Bitcoin will cost more than 200 000$ usd this is very valuable knowledge and prediction. If you think that there is 50% chance that weak team will win against huge favourite you can also win money on that. If you are not pursuing money but public intellectual reputation mechanics are still the same - for prediction to be valuable it’s have to be matched against both reality and opinions of other people. So how to do this early prediction thing much better - 1. make a list of thing to predict like the one here (1. Biden approval rating (as per 538) is greater than fifty percent etc) but without probabilities 2. Then ask in the survey the audience of your blog to give their probabilities of those things. 3. Then write your predictions. 4. In the end of the year for every difference in yours and average audience prediction assume a virtual bet with odds in the middle of two predictions. For example if you think something is 70% and audience thinks it’s 50% then with middle of 60% odds for your bet are 1,8. 5. Calculate ROI/winrate of your predictions against blog audience.
Isn't your prescription... exactly what happened at the end of this post...? Not against his readers, but he compared his success against Zvi and the market, which seems like exactly what you want (and with a scoring system that avoids a lot of the complexities you describe).
Also the idea that calibration is gameable by putting 70% on two sure things and one low probability thing is true, but only if Scott's goal is "show off how well calibrated I am". I, and I suspect most readers, have enough faith to believe his goal is to compete with himself and improve, not to fake success. As long as he (or anyone making these predictions) is being honest about their beliefs, the gameability shouldn't be an issue. Or at least it should be pretty unlikely, it'd come down to dumb luck to have 30% of your 70% predictions be extremely unlikely and 70% to be extremely likely.
Not exactly. He's comparing himself there to an expert forecaster and an entire market of experts, that you have good reason to think are going to be more accurate than totally uninformative picks. What you want to do is compare your predictive success to a base rate. The base rate isn't other experts. It's an uninformed prediction. You see this often with machine learning model scoring, where accuracy rates are compared to a model that selects randomly or without training.
In this case, something like "relations between county X and county Y will be worse than each of the last five years" would have an uninformed base rate of 1 in 6. "Team X will win the Super Bowl" is 1 in 32. "President X has approval over 50%" should probably just be 50/50 if you assume with a complete lack of information that approval is uniformly distributed, but you might skew to match some historical average of all presidential approval ratings, which is probably not 50%.
But the basic point is to compare your performance to a predictor that isn't really trying, not to compare yourself to other experts. You can still do that, but their scores should also be similarly adjusted.
Scott, you’re a rationalist, use iso dates! Otherwise none of your UK readers believe that you graded your predictions before 3/1/2022
It sounds like you were able to decipher the dates.
I mean, yeah, we're able to work it out with enough effort. Doesn't mean it's a bad idea to reduce the amount of effort required (and head off some actual misunderstandings at the same time).
Works for French readers too. Pedantic point: the ISO standard is very complex and you can't access it without paying. However, you can use RFC 3339 which is a simple subset of that. Ideally, use YYYY-MM-DD. The "-" and year before will signal to people that it's the ISO/RFC format.
Actually, it's not true that the standard is complex and you can't access it without paying. The whole point of a standard is to get everyone to use it, which requires that they know what it is, which they will not if it is complex and you have to pay to find out what it is. Therefore the "ISO standard" is not actually a standard.
Similarly, the actual C programming language standard is the supposed "draft" that can be downloaded for free, not the "official" version that you have to pay for.
Of course, despite this, everyone should use YYYY-MM-DD format for dates. Writing a date like 10/11/12 is ridiculous.
You can very easily use the C programming language standard (the actual one, not the draft) without paying for it, or even reading it. You just need to use a compiler that adheres to it. (And ideally one that gives you decent messages when you write non-standard code so you can fix the problem more easily.)
You rely hundreds of times a day on many other standards which are neither simple nor free to access, from your power plugs to the buildings you enter to the roads you use. Simply plugging a phone charger into a wall and your phone relies on many dozens of standards, some of which are pretty darn complex.
And yes, everybody should use YYYY-MM-DD, regardless of arguments about what a standard is or isn't, for a whole host of good reasons.
A standard needs to be known by those who use it. For power plugs, those would be the people who make plugs, and the people who make outlets, who may be willing to pay to read the standard. The rest of us just need to know how to use these plugs, not the exact distance between the prongs.
But everyone reads and writes dates.
And every C programmer needs to know how the language works. C is not a language where "it worked when I tried it with gcc 9.3 on an ARM v7 processor" is any sort of guarantee that it will work with another compiler on another processor. And if a compiler writer noticed that the "draft" and "official" standards differed in some meaningful way, they would be well advised to implement the version in the "draft", since that's what the programmers read.
> But everyone reads and writes dates.
Everyone reads and writes dates, not everyone reads and writes ISO dates. The reason I mentionned RFC 3339 is that it is a standard that you can easily adhere to. Precisely, the YYYY-MM-DD is a subset of both ISO 8601 and RFC 3339, meaning that you're compatible with the "standard as in regulation" way to do dates, and you're also accessible to most people in the world.
Not everybody needs to or should be writing full ISO timestamps. The YYYY-MM-DD part of the ISO standard is just fine for the majority of situations, and you don't need to know the rest of the standard to do that.
If you're using "it worked" as opposed to turning on and reading the warnings from your compiler, you're so likely to make an error you miss when you think you're following the standard that the standard will make little difference. And no, compiler writers should not implement errors in the draft that have been fixed in the actual standard.
There are numerous aspects of C that you need to know but which you will not find out about from paying attention to compiler warnings. To pick two examples from many - a variable declared as "char" may be either signed or unsigned, depending on the implementation, and overflow in an arithmetic operation on signed integers is "undefined behaviour" (anything can happen, often without a warning), whereas overflow in operations on unsigned integers is well-defined (modular arithmetic).
And finally, obligatory Xkcd: https://xkcd.com/927/
(This is in a separate comment because I have the feeling that any comments with URLs are held for spam prevention, and perhaps even lost from time to time.)
That is why I only talked about "the ISO standard" and not "the standard". It is a fact that the ISO standard for dates (ISO 8601) is complex, and it is a fact that you (or someone) needs to pay to access it.
> Similarly, the actual C programming language standard is the supposed "draft" that can be downloaded for free, not the "official" version that you have to pay for.
I don't think this is true. If you sell a "standard-compliant" compiler, you'll have to buy the standard, and you can't rely on the drafts. I don't pay for the standards, so I'm not aware of any differences with the drafts. I hope there aren't any, since lots of people rely on them. But that's just how I, someone that doesn't use C for business, does stuff. If we use C at work, you can be sure that I'll ask for my employer to pay for the standard.
When it comes to thing over which you have some control, like personal projects, how much overlap is there between predicting and planning? For example 96,98,100,103-105; when you were making those predictions, were you setting priorities for the year?
Also, did you feel more pressured to follow through on things you rated as more likely to do, and less pressure on things that you didnt really think you were gonna get to anyways?
Scott, are there any predictions you suspect had their outcome changed due to you making them?
- ✅ and ❌ unicode symbols exist and I assume substack supports them. Why not use that instead of hard-to-read bold/italicized?
- and how is #106 unresolved? Is it ambiguous on what it means to have a draft? (a document containing just a title? a half-written post?)
Personal predictions are meaningless. I am about to cross the street--prediction: I have crossed the street by tomorrow.. 50% predictions are also meaningless. I was ridiculed here for astrology: I have way fewer predictions published with timestamps on world affairs but all but one has turned out to be correct. What do you base your predictions on?
Also astrologers expected to be 100% correct every time all the time and since they are not the whole discipline is dismissed. Amazing how many here got seriously involved with these predictions never questioning epistemology.....
Not at all. I would be interested in seeing you log your predictions the way Scott is doing here.
I agree that Scott should probably avoid predicting things that are in his control, or if its personally interesting to him, he should score it separately .
If you have a web page where you document your predictions I'd be interested in seeing it.
personal predictions are not useful as evidence for/against someone’s prowess as a forecaster. but they can be useful in finding out how well you know yourself. of course, if you cheat, you get less out of it, but why would you cheat if you’re doing it for your own benefit anyway?
Yeah a lot of the criticisms, both for personal predictions and for 50% predictions, only apply if you assume Scott is acting in bad faith. As long as he's being honest with his predictions, and is primarily trying to learn rather than trying to prove something, both are totally fine.
I think the problem with Scott's approach is that he lumps local predictions (personal, work, blog, community) together with non-local predictions (politics, econ/tech, COVID-19). The one set of predictions tells him how well he knows and can predict things around him. The other set tells him how good of a handle he has on news-like stuff. One he is intimately involved in, the other he's mostly a spectator.
There's a failure mode that lumps together things that aren't related and makes faulty conclusions based off of hidden assumptions. I think putting a number on a thing (X% probability of Y happening) makes that failure mode easier to fall into, because it hides your assumptions. You see two numbers, both probabilities, and assume you can lump them together since they now have the same format. But a prediction about the date of your future wedding is fundamentally different from a prediction about the price of Bitcoin. It's not a question of whether you're predicting in good faith, but about whether you've accidentally fooled yourself by hiding your assumptions from yourself.
That will then lead you to calibrate based on faulty assumptions. Garbage-in/garbage-out, as it were. Here, Scott has ~30% predictions that are public/non-local, versus ~70% that are personal/local. I'm not sure he has enough predictions in either sub-set to do the separate calibrations well, making the exercise difficult to draw conclusions from. I would certainly not be confident in this kind of calibration exercise.
You're making a huge leap here from the obviously true [[these two styles of predictions involve non-identical skill sets and could have very different underlying calibration level]] to [[these two styles of predictions are so fundamentally different that they don't tape any skills]]. I agree that it's totally plausible to be well calibrated on one and badly calibrated on the other, but that doesn't make bucketing them together useless... "garbage-in/garbage-out" is way too harsh, it's something much more akin to "imperfect inputs, be careful what you do with your outputs"
And the obvious solution to this is to also check calibration on the buckets separately. I suspect Scott does this. But then you also say that's useless because he doesn't predict enough things? You'll always be able to slice the buckets more, e.g. "you can't lump politics predictions with medical predictions!" or "you can't lump predictions about your own wedding with predictions about your friends' relationships!". It seems like "this exercise is useless if you slice things in ways I don't like" and "this exercise is useless if you have buckets that are smaller than I like" will make this perpetually impossible.
But I agree with a scaled back version of your claim, if Scott hasn't bothered also looking at his calibration on the buckets separately, that seems like a useful exercise.
You're right that I'm a bit too overzealous in my reply. I'll take the correction.
We can test the hypothesis, "non-local predictions are fundamentally different from local predictions" by breaking down the predictions into the two buckets. I already predicted the N is probably not large enough to make any definite conclusions from them, and looking at the numbers (~30 predictions vs. ~70 predictions) that's probably true. And although we could probably sum up across multiple years, that's more time than I want to invest. I'm willing to break down one year of predictions, so let's call this a starting point:
Non-local predictions (politics, econ/tech, COVID-19)
(predicted || actual)
50% || 40%
60% || 29%
70% || 86%
80% || 57%
90% || 100%
95% || 100%
99% || 100%
Local predictions (community, personal, work, blog)
50% || 57%
60% || 47%
70% || 71%
80% || 100%
90% || 89%
95% || 100%
It looks like Scott is much better calibrated at predicting personal events than at predicting non-local events. He has a lot of imprecision in his ability to forecast non-local events that appears to be masked by the much larger number of local events he predicts (>2x more).
Again, I'm not claiming he's doing this intentionally to try and make his results look more calibrated. I think he's honestly trying to improve his ability to assign probability to future events. I'm just saying that it's easy to fool yourself with false assumptions, and that assigning a probability estimate to a thing can introduce false assumptions without you realizing what you're doing.
You're right that it's possible to create different categories ad infinitum, but that's attributing to me an intent to discredit Scott's approach in a way that's uncharitable and not what I was trying to do. I'm not interested in an infinite sub-division of predictions. For me, the idea of tracking the calibration matters because of the predictions themselves. I care about the predictions and understanding my uncertainty, not about whether I can get numbers on a graph to line up with a low variance from an expected trendline. In that sense, it's perfectly valid to sub-divide when lumping predictions together hides my ability to understand predictions and uncertainty. Sub-dividing is necessary to answer questions like:
"Do I have a realistic assessment of whether my goals for next year are achievable?"
"Am I overconfident in my understanding of the things I read about in the news?"
"Do I understand the relationships around me as well as I think I do?"
Those are all separate questions. And making a prediction like, [friend has gotten a job] does a lot to answer one question and nothing to answer another. Lumping 100 predictions together can easily bury the signal with a bunch of unrelated noise.
If the question is, "Will there be a major conflict between the US and Russia?" my interest in being well-calibrated on predicting that kind of question is fundamentally different than for "[friend moves back to Indiana]". Those two predictions are obviously unrelated. Yet Scott lumped those two questions together and applied a mathematical formula to them. I did not. I questioned whether it was wise. And, as you pointed out, went on to declare with equal zeal that I have no confidence in that approach.
What are the redacted predictions!
This link is not related to the article, and so should not be advertised in this space. Scott JUST released guidelines for this.
Scott. Considering you've been doing this a while, have you graded the accuracy of your predictions over time by the thematic baskets? It would be interesting to know if you're notably better (or even significantly better) at predicting things about your friends than the economy or meta rather than yourself, and this would presumably highlight strengths and weaknesses that might help improve future predictions?
I look forward to this year's prediction of the likelihood of you remaining married with voyeuristic glee! Although possibly, assuming your wife reads the predictions, giving remaining married a high probability would in fact make it more likely to happen?
I would argue that all of the conflict ones should be graded true. Russia has over a hundred thousand troops on the border openly, so that’s quite the intimidation tactic regardless of shots fired, same with China’s flights at the edge of Taiwan air defense. Israel had the largest number of rocket attacks in at least several years, if not the decade plus in 2021, though that has been quieter recently after the retaliation strikes.
Possibly only ambiguous on a couple, but I don’t think any are outright false.
I think large numbers of troops (smaller earlier in the month) around Ukraine is still less of a flare-up than 2014, with the annexation of Crimea and open warfare in the Donbass. Likewise, there doesn't seem anything exceptional about the 2021 upswing in violence between Israel and the Palestinian militants compared to previous years (certainly Israel has been more engaged in the past).
Taiwan doesn't seem as clear cut, but I guess Scott's assessment criteria were for escalation of actions rather than just increased levels of the same actions as before.
It's getting there but it's not quite flared up yet, certainly hadn't by 1/1. I am 80% that there will be a larger flare up than 2014 by 12/31/22
"Yang is New York mayor: 80%"
So do I take from this that you expect Andrew Yang to be mayor of New York at some undetermined date in the future? Because this time round, I didn't rate his chances at all (and this seems to have been borne out), and by the bitter complaining about racist political cartoons, I don't think he'll run again or at least not soon.
https://www.nbcnews.com/news/asian-america/new-york-daily-news-changes-drawing-after-backlash-over-andrew-n1268695
In fact because the NYC mayor's term runs to the end of the year, that was a terrible prediction: Even if Yang had won, it most likely would have been false.
9. Major flare-up (significantly worse than anything in past 5 years) in Russia/Ukraine war: 20%
Do you not think there is already a significant flare-up in the Russia/Ukraine conflict?
https://en.wikipedia.org/wiki/2021%E2%80%932022_Russo-Ukrainian_crisis#:~:text=US%20intelligence%20assessment%20on%20the,number%20could%20increase%20to%20175%2C000.
I would call 100000 soldiers at my border a major flare-up.
But would you call that "significantly worse" than an invasion and loss of territory?
No, but I would call it worse than a referendum causing a border shift.
(Yes, I realize that voter turnout percentage is likely fudged and that there was Russian military presence, but polling for years showed that a referendum would likely favor Crimea joining Russia.)
For curiosity's sake, in which way 84 "I have switched medical records systems" resolve ambiguously? Mid-transition? Changed to a later version with significant redesign but same vendor? Uses the old system but grumbles more and does some of the work on paper? Or what?
I also thought "On the Natural Faculties" was the most well written! Way to go!
Scott, maybe you shouldn't include predictions on something you can personally influence to a significant degree in your set. Or at least you should categorize them and judge their reliability in a totally different category, apart from others. Because this lends itself to "hacking", and a smart guy like you can "hack" his life enough that he gets precisely the mix of results he needs.
Look, 50% predictions are fine, the only question is: what do you learn if you are miscalibrated at 50%? It depends on how you determine which way to word the prediction ("X will happen" vs "X won't happen").
- If you determine it by flipping a coin, you learn nothing. Or maybe that your coin is biased.
- If you determine it by asking a friend, you learn something about your friend.
- If you phrase it positively rather than negatively, you learn that "things" are more likely to happen than you think. This is pretty vague.
IMO the obvious solution is to treat it as 50+epsilon: if you HAD to pick, would you pick X or not X? Then if you get 90% right, you learn that you are underconfident, just like every other bucket.
I've advocated binning inverse predictions before for lists from prediction markets, to improve readability in those lists and avoid similar things getting sorted far away just because of phrasing.
For reading lists of pending predictions I think it's really helpful.
For calibration I have a small worry that it could obscure some biases. If you are underconfident at 80% and overconfident at 20% (you hedge towards coinflips), those would wash. Even if you are just underconfident at 80%, a well calibrated 20% will still dilute that signal.
However, binning might be necessary. Maybe you just don't have enough predictions in a bin to analyze otherwise.
In which case... eh, bin, probably fine? You just need to do occasional spot checks to ensure there's no weird underconfidence or overconfidence below or under the 50% mark. (You could normalize your initial predictions so that they are only above 50% in the first place, dodging this whole issue from the beginning.)
If you find your bins are consistently too thin, I wonder if you could just group a few years together, at the risk of obscuring any changes in prediction style year to year.
PS - I'm waiting for the reply "yes, this was all discussed and resolved back in the comment threads years ago..." I'm at maybe 70% that Scott and all the smart people here have probably been down this road before. If so, sorry for retreading.
Since you make predictions with finite resolution, you could avoid the 50% issue by picking ranges instead of points; e.g., [50%, 60%) vs [40%, 50%).
i comment separately to note that 50% predictions are absolutely meaningful, especially when "binned" together