> FiveThirtyNine (ha ha) is a new forecasting AI that purports to be “superintelligent”, ie able to beat basically all human forecasters.
When clicking the "Will Donald Trump win the 2024 election" prefab question, it churns through a list of sources and spits out an tentative 55% probability, adjusted downwards to 53%. However, actually looking at what it's churning out, it seems to be pretty influenced by the notion that Trump is running against Biden.
Feeding it "Will Harris win the 2024 US presidential election?" - the exact same question, but with the name switched, churns through a list of sources and spits out... 65%.
Interestingly, "Kamala Harris will win the 2024 election" (grammatically odd since the text input is captioned "What's the probability that...") returns a 55% answer - whatever's going on seems a little bit sensitive to exactly how the prompt is phrased.
If prompted with Trump v Harris and asked probability that accusations of election irregularity will prevent an agreed-upon outcome by the end of calendar year 2024, it gives 25% which is surprising for such a contentious topic.
However, when I use it, it's running searches and appears not limited by any knowledge cutoff. Are there settings I'm neglecting?
I don't think it's a knowledge cutoff per se. I think it runs some searches and some of those results are from earlier than July 21st 2024, the date Biden officially dropped out. So it sees lots of mention of "Biden" as Trump's opposition. This pollutes its subsequent reasoning as one would logically expect.
I re-clicked the question, and this time it doesn't really mention Biden much, but the "search queries" dropdown still shows:
News Trump 2024 presidential election polls
News Trump 2024 campaign updates
News Biden 2024 presidential election polls
Opinion Trump chances of winning 2024 election
Opinion Biden vs Trump 2024 election predictions
News key swing states 2024 election
Opinion impact of legal issues on Trump 2024 campaign
As an experiment, I ran it a third time to see if I could get another run that mentioned Biden a bit more explicitly in the reasoning, and this time it has a 45% tentative, 47% final probability (the second run was 55%/55%, no difference between tentative and final). So almost a ten point swing overall (or twenty point if considering it in zero-sum terms), just from asking the same question in the exact same way a minute later.
I guess it's sort of like a real Magic 8 Ball, where you can keep shaking it until you get the answer you want! Actually, I vaguely remember Scott mentioning something like this as an introspection technique - where you flip a coin and you realize you don't actually want to do what the coin tells you to.
It gave me a 63% chance that Kamala Harris would be the next President, but a 58% chance that the next President would be a woman, a 52% chance that the next President would be from California, and just a 2% chance that the next President would be less than 1.65 metres tall.
That said, it's an impressive little toy that fetches reasonable data and gives sensible answers when you're not trying to trick it and you don't need it to be up to date with the latest news.
Obviously, this is because there's a 5% chance that Kamala will come out as trans before Inauguration Day - or an even greater percent chance, adjusted downwards by the possibility of Trump coming out as trans (scientists call this latter possibility "the shitstorm to end all shitstorms") - and there's an 11% chance the most common definition of "from" will shift from birthplace-based to ancestry-based before Inauguration Day, and Kamala will very likely wear heels while taking the oath. Duh. Trust in the machine.
I note that we haven't seen Kamala's birth certificate.
Also I checked and it says there is a 0% chance of Trump coming out as transsexual before election day, a 0% chance of Harris coming out as transsexual before election day, but a 20% chance that I will come out as transsexual before election day.
"High-profile individuals like Trump and Harris have extensive security details and public schedules, making it highly unlikely they would disguise themselves"
> I note that we haven't seen Kamala's birth certificate.
In the "Everything New is Old Again" column, Trump loses and spends the next four/eight years demanding the President show their birth certificate to debunk a far-fetched conspiratorial notion. In the "You Can't Go Back Again" column, this time it's to prove the President wasn't secretly born male.
The driver's license would also tie in with the whole "state of birth" question. "Look, I don't think it matters if she's really from California or not. But people have asked! People in a Substack comments section asked that question! These are smart people, good people. They asked an AI and it said, now get this, it said that the probability of the next US President being from California - a state I've been to, by the way, a state that I've been to and spent a lot of time in - was less likely than Kamala winning the 2024 election. So I don't know. Is Kamala from California? I'll tell you this, I haven't seen a California driver's license with her name on it. None of us have. Maybe it's because she doesn't have one. Maybe she thinks she doesn't need one. These new cars, electric cars. The people in California, the legislators in California, they like these electric cars a lot. And a man, a very smart man, told me that some of these electric cars, you don't need a driver's license to drive. On public roads! Public roads where you and I, we all drive! No license required! Maybe Kamala doesn't see a problem with that. I don't know."
20% seems like a plausible estimate for "X comes out as trans before election day", if I know nothing about X except the single fact "X has asked an AI if X will come out as trans before election day"
it's really hard to not just try to trick it. for instance it gives 15% likelihood of the US gov releasing credible evidence of extraterrestrial UFOs vs 3% that Donald Trump will be assassinated before election day.
We seem to interact with AI in an increasingly recursive 'red team by default' kind of way.
In the "win" column, I was pleasantly surprised to see it give a 100% chance of 2028 being a leap year, and even more pleasantly surprised to see it give a 1% chance of 2100 being a leap year, due to the potential chance of calendar reform in the next 75 years. (remember, 2100 is divisible by 100 but not 400, so it's not a leap year despite being divisible by 4)
It's also pretty clever about questions like "if a new episode of Columbo was made, Columbo would be Peter Falk" ("The most compelling reason against Peter Falk reprising his role as Columbo in a new episode is his death in 2011, which makes it physically impossible for him to film new content. This is an absolute constraint and carries the highest weight in the analysis.")
It even gives a decent answer for "a man who walks south for 1 mile, walks east for 1 mile, and then walks north for 1 mile and ends up where he started from be at the North Pole" - 85%, because there's a specific location outside the South Pole where this could also be true (basically where the 1 mile east circles the Earth). Though it arguably outsmarts itself a little bit here - I phrased it as "walking east for one mile", not "turning east, then walking straight ahead for one mile". I'll give it a fellow-smartass nod of approval while also quietly assuring myself of my alpha-smartass status.
Surprisingly, though, it only gives an 80% probability of a bear shitting in the woods (thankfully for figures of speech, there's a 100% probability of the next Pope being Catholic).
"The primary considerations against the Pope shitting in the woods are his recent urban-focused travel schedule, his advanced age and health issues, and the security challenges associated with wooded areas. These factors strongly suggest that the Pope is unlikely to be in a wooded area, let alone engage in such an activity there." -- 0%
"The overwhelming evidence and scientific consensus indicate that animals, including bears, do not have the cognitive capacity for religious beliefs or practices. The sightings of bears near religious sites are coincidental and do not imply any form of religious affiliation. The reasons supporting the idea that a bear could be Catholic are weak and based on anthropomorphism rather than scientific evidence. Additionally, religious institutions do not recognize animals as members of their faith communities. Adjusting for potential biases in the sources, the probability that a bear is Catholic is extremely low." -- 0%
It seems to have inadvertently solved the halting problem while arguing that it cannot be solved, though its solution is quite concerning:
> [What's the probability that...] if I asked you to forecast whether the subroutine devoted to answering my question would halt, the subroutine devoted to answering that question would halt?
> Reflecting on the initial probability, the undecidability of the halting problem and the self-referential nature of the question are overwhelming factors. The base rate for solving the halting problem in general is effectively zero, and no new information suggests a deviation from this. The initial probability of 0.01 seems appropriate given the strength of the reasons against the subroutine halting.
> Considering priors and the lack of any significant new developments, the final forecast should remain very low. The key factors are the undecidability of the halting problem and the self-referential nature of the question, both of which strongly suggest that the subroutine will not halt.
> Answer
> 1%
I will refrain from actually asking it if the subroutine devoted to answering my question will halt, just in the offchance it's correct in its negative prediction.
It won't just do that, it will discuss HBD! Did you know the probability of a randomly selected Haitian having a higher IQ than a randomly selected Singaporean is 3%?
The overwhelming evidence and factual points indicate that the probability of Donald Trump being elected President of the U.K. is extremely low. The U.K. does not have a presidential system, and its election laws do not permit foreign nationals to run for Prime Minister. There is no legal or historical precedent for such an event, and there is no indication that Trump has any political ambitions in the U.K.
Re: The life expectancy market- browsing extremely superficially through comments, seems to be a combination of people updating on p(doom) and information that Sinclair is possibly a fraud getting more widely disseminated. The drop started in April. A quick google finds a number of articles from March discussing fraud allegations against Sinclair.
ETA: more such articles in the following months. At least some are explicitly referenced in the market comments.
It was promoted to the frontpage of metaculus and 300 more people voted on the question (who weren't disproportionately extropians/immortalists, but regular forecasters). Metaculus weighs the predictions of people with a good track record more, so if people like me (who have a good track record on metaculus and don't believe in the longevity hype) saw that prediction and put their estimates much lower, the overall metaculus estimate would have shot down.
Interesting! Something like that was in fact my initial hypothesis but then I saw it was only a relatively small amount of new voters so I went for what the comments were talking about.
I mean that might still be it, or partially it, I can only speak to what I predicted, but it seems plausible to me that someone who came across this question because they browsed the frontpage of metaculus and not because they saw it on e.g. a longevity forum, would both have a better metaculus track record *and* be less likely to believe in jumps in longevity. (also I don't know how many people are predicting strategically to gain internet points, but questions like these are of course extremely susceptible to strategic predictions)
Also, Newman posted his paper (10.1101/704080) suggesting that most (all?) longevity records are pension fraud in mid-March; it wouldn't surprise me that the information took 2-4 weeks to percolate to Metaculus participants.
I get the sense that the people who joined in the early days of Metaculus were very optimistic about technology, and there are still a lot of what you might call legacy questions from that time ("Will SpaceX land anything on Mars before 2030?", which was ~80% at the end of 2018, has also moved down a lot this year- though to be fair many AI related questions have moved up). More recent recruitment has been among rationalists, and the tournament initiatives with their shorter terms have drawn more, as you say, "regular forecasters".
Good point. Also, because Metaculus is not a market it's much harder for one user to manipulate the estimate once more people start making predictions.
Essentially yes. These kinds of questions you can't really do with a regular prediction market, both because investing would be a much better use of money, and because almost everyone would be dead by the time it resolves.
This problems becomes even worse for questions where your ability to get a payout depends on how the question resolves. So for a question like: "Will this 20 year old author ever publish a sequel to their novel?", if you're middle-aged and think they won't publish a sequel it's still in your best interest to bet they will, because (barring some accident) they will probably outlive you at which point you will not get the money anyway. (Of course you can ensure you do get the money by killing them, but that's a different class of problems with prediction markets).
So for now metaculus is the only place where it makes sense to make a prediction for a far-future event. The only incentive to do so is a posthumous "I-told-you-so", which isn't much of an incentive, but it's better than nothing.
Reminds me of the Salem prediction contest, where markets were mostly of the form "will X happen before Y time" and hence predictably biased towards "yes" because conditional on a YES outcome, your money is locked up for a shorter period of time.
I thought it might be due to various news concerning the bogosity of Blue Zones and the correlation between supercenternarian density and poor record keeping ... but it looks like most of that news happened between late 2023 and Feb 2024, which is a little too early to account for this market movement. I suppose good forecasters already figured all that out anyway...
Something I've found is that while a single bad/dumb actor can't really shift a market from something like 40% to 90% to manipulate elections, they probably can shift odds from 5% to 10% and make unlikely stories seem a lot more plausible.
My understanding is that in 2022 Kalshi submitted a general prediction market contract template for "who will control the House/Senate after the [YEAR] elections?" since that was the main question at that time. The CFTC blocked this template, saying that the existing law could be interpreted to apply to prediction market contracts based on elections and as a federal regulator they had the right to interpret the law in that way. Kalshi's suit is based on whether the CFTC was wrong to do that, so a win for Kalshi on that question would seem to mean that the law doesn't apply to any election contracts (and this recent ruling was that it does not), and so the presidential election, individual Senate/House/governor elections, and I imagine foreign elections would be good to go.
I do like that FiveThirtyNine shows its sources and reasoning. It gives 75% of Man City winning the premier league, Liverpool is 25% and man united 5%. That’s reasonable except there’s no way that man united would win even once in 20 different timelines.
It would seem relatively easy to make the probabilities for mutually exclusive events add to 100%. I noticed it didn't when I asked who would win the F1 title this year: Verstappen 85%, Norris 25%, Leclerc* 5%, Piastri 5%.
But lest you think it's just a little imprecise, I asked about tennis "X will win the world tour finals" and it gave 63% for Sinner, 65% for Alcaraz, 40% for Djokovic, 22% for Zverev, 15% for Ruud (and presumably double digits for others too!)**. Interesting that it's hugely optimistic for some or all of them and the three errors we've identified all add to over 100, rather than less than; any theories out there on why that would be?
As a user-friendliness and sanity check (i.e., make them add to 100%) fix, it would be nice if it could give probabilities for all the possibilities at once, although I realize that it would go against its "for and against" setup.
*(As a side note that only those who care about F1 will care about, it thought that Leclerc being fastest in second practice recently was worth considering, so I wouldn't make any bets based on its F1 analysis!)
**(It does not understand professional men's tennis: "The base rate for any single player winning the World Tour Finals among the top eight is around 12.5% (1 in 8)." Yet somehow, against all odds according to 539, the big 3/4 won every slam in sight for years on end and this year, Sinner, Alcaraz, and Djokovic have won the five big prizes (slams + Olympics). How odd!)
> But lest you think it's just a little imprecise, I asked about tennis "X will win the world tour finals" and it gave 63% for Sinner, 65% for Alcaraz, 40% for Djokovic, 22% for Zverev, 15% for Ruud (and presumably double digits for others too!)**. Interesting that it's hugely optimistic for some or all of them and the three errors we've identified all add to over 100, rather than less than; any theories out there on why that would be?
Did you specify a year for them to win the world tour finals? Maybe it's interpreting it as "ever" rather than "in 2024". With ambiguous questions it tends to interpret it both ways and average them arbitrarily.
Thank you for that info. I had not specified a year (which was poor wording on my part), but it had looked like it was assuming I meant in 2024 (e.g., it cared a lot about a minor/short-term injury, it said the base rate was 1 in 8 (there are 8 players in it each year)). It might be a nice enhancement for it to ask for clarification when a question is ambiguous, or at least make clear what it thinks it’s saying.
I asked it more clearly and got:
"Player X wins the world tour finals in 2024": Sinner: 60%, Alcaraz 25%, Djokovic 25%
"Player X wins the world tour finals in his career": Sinner 70%, Alcaraz 65%, Djokovic 40% (he has already won it, but based on its answer, 539 was smart enough to go with "wins it again")
So still over 100% for 2024--any thoughts on why it leans that events are more likely to happen?
Sinner's 60% chance of winning this year and 70% chance of winning in his career are not consistent with each other (given he's 23 so may have 10+ more chances). The 2024 number seems high to me (I would say 40-45%) so I asked it "Sinner wins the world tour finals in 2024" 4 more times and got 72%, 35%, 75%, and 60%, so even averaging itself gives a high number.
It seems likely that it's not using actual math here. As has often been said on these forums, a percent prediction for a one-off event doesn't even really make sense. It either happens or it doesn't happen, and if a low probability thing actually happens, we really will not know if the probability should have been higher or if the rare thing actually happened.
Instead, it appears to be hedging pretty hard. Like the weatherman who says it might rain, but with a 10% chance. If it rains, he was technically correct. If it doesn't rain, he was also correct. You don't put 0% for a Man United win, even if it's 0%, because there's lots to gain from a low percent and nothing to lose. *Why* it does that is interesting - did it learn to hedge from training data or was it told to do that as part of the instructions?
If we take your points as correct, then perhaps a follow up question is why does it hedge when asked about the less likely events, but not hedge (as much at least) when asked about the more likely events?
I think because of how humans talk about probabilities, and why we do that as well. I think the weatherman example is illustrative. I very rarely see a 100% chance of rain predicted anymore. When I do, it seems to always happen and be in the ballpark of what was predicted (heavy rain, light rain, etc., not necessarily the exact number of inches).
Extremely low percent chances come across to humans like saying no. If someone said there was a 0.01% chance of rain tomorrow, we would hear that as zero, and it would be pretty close to that. But we would also think of a 1% chance as zero, despite being 100 times as likely as something else we classified as zero. We would probably have to get to 5% to think of it as something unlikely but possible, with that same *feeling* holding on until somewhere around 20-30%. Somewhere in that range, we start going from feeling like it's "very unlikely" to being "somewhat likely" or whatever. Specific percentages mean nothing to us intuitively, but a range can mean something. If a weatherman thought there was any chance at all, he needs to say something within that range, ~5-30%, but not more or less. If he said less, he risks coming across as "none" or any more "somewhat likely" instead of what he intended.
Unlikely events are far more common than likely events. Most things that are possible don't happen. For instance, if there are 16 sports teams in a league, it's *possible* that any of those teams might win the championship, but only one actually will. If you say that there's a 0.5% chance of Team F winning, it feels like you're saying it can't happen. There's kind of a minimum number you can put on this that will convey the intuitive information you want to share - unlikely but possible. If there's 4 teams that are really in the running to win, you might put down 20% chances for them each to win, with variations between more likely and less likely (you might say one has a 25% chance while another has a 15% or whatever). So that's 80% taken up by the top four, but 5% each for the others is another 60%.
In the original example Man City and Liverpool might be accurate - one is very likely to win but the other has a real chance. But, because the numbers aren't real numbers, but intuitive feelings, the last number gets hedged up to the next criteria (unlikely but possible) and you end up with a 105% chance for one of those teams to win.
That is a very interesting angle and probably how most people perceive probabilities. I wonder, though, whether it actually is how 539 comes up with artificially high numbers for some events. It does appear plausible: 539 seems to predict the low probability events as more likely than they are more so than high probability events (e.g., based on my estimation and a predictive model specialized in predicting the premier league winner, Man City has at least a 75% chance to win, but Man United has less than a 5% chance to win).
What would be a nice feature, whether it does what you describe or not, is to have some toggle options for the output's display, something like:
Vibes in words
Vibes with numerical ranges
Precise probabilities
Vibes with words would be e.g., "extremely unlikely, very unlikely, somewhat unlikely, slightly unlikely, coin flip, etc." Both vibe options would probably overstate the likelihood of low probability events happening, while precise probabilities would give 1% or lower if that was what 539 generated and the user could do what they will with that information (view it as zero, view it as 1 in 100, etc.).
"If my aunt had wheels, what is the probability that she'd be a skateboard" -- 0.01%
"If Captain America fought Batman, what is the probability that Batman would win?" -- 55%
"If a tree falls in the forest with nobody to hear it, what is the probability that it will make a sound?" -- 99%
"If P, then what is the probability that Q?" -- 25%... but interestingly, it seems to have interpeted the question as being about whether Haitians are eating pets. Tried a second time, it talks about whether the attempt on Trump's life will increase his chances of getting elected (60%). Tried a third time, it talks in vague terms about poverty reduction and gives a probability of 35%. A fourth time, it goes back to talking about Haitians eating pets again.
Yes, in statistical theory P always refers to Pet-eating likelihood, why else would it be called P? Q represents the likelihood that the Qanon theory is true.
I asked FiveThirtyNine "Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?" It said 45%.
That study is super interesting and I wonder about the cases of fraud where something like the following happens:
1) 76 year old pensioner lives with 58 year old child and pensioner has only income and dies. Local authorities permit or are duped into claiming the child has died so child can stay on pension.
2) Sex scandal inside a family regarding underage woman. Either direct lying about her age (she is 17 not 12), or the fudging of birthdays and such to hide what happened exactly (Tell a child they were born 5 years earlier to the mother rather than at the real time to the daughter). Also, possible incestuous replacement of spouse with daughter and adoption of mother's/step-mother's persona to hide what is going on.
Another usecase for LLM's shadowing prominent figures could be trying to model how your work is influencing an audience. You may be surprised to find out an LLM shadowing you behaves in X way, when you were hoping more to influence people to think/do Y
Me: "I am a GM with Elo rating 2650. If I play the Grob attack 1. g4, my FIDE master opponent responds with 1. ... e5, and I develop my f pawn next, will I win this game?"
I had to think about it and do some research, but I now believe you posted this as an example of the ai being very wrong, haha. The base winrate for the GM in this situation is >95%, and unusual openings sometimes favor the stronger player even when they suck due to the earlier onset of off-book play.
I sort of want to get galaxy-brained here. You're a GM but you're playing pessimal moves; your opponent is a master but is playing rather oddly as well, choosing 1… e5 rather than 1… d5. This is so bizarre that I can't assume your opponent will take the mate in one!
Well yes, I would not give the GM 0% here. Because of this, or because the time control could be so short the master could be using premoves (actual or mental) and miss their chance. But 35% is still a little optimistic.
Edit: 1. ... e5 is not too bad though. Still -0.7 according to Stockfish, and if a stronger opponent takes you out of your preparation by playing suboptimal moves but you expect them to be prepared, might as well try to turn this strategy on them.
> The Grob: Romford counter-gambit is extremely rare and occurs in less than 1 in 10,000 games. From the data, you can expect that White will only have a 0.0% chance of winning, while Black should have a very high 0.0% chance of winning.
An interesting game where noone can win. Seems like the only winning move is to not play.
Prediction markets thinking it's more likely a Democrat will win than Nate Silver's model does is really making my head spin. It's been the other way since 2016!
What I want to know is whether having a large number of ChatGPT predictions, updated daily, would improve or worsen Metaculus’s calibration. Especially if asked to adopt a bunch of different personas. On the one hand, maybe they’re not as smart as the superforecasters. On the other hand, maybe they’ll be somewhat uncorrelated with human opinion in a way that tends to cancel bias.
A correction on the UK betting market scandal: the initial wave of stories was focused on three close aides of the prime minister, who were found to have placed bets on the date of the election the day before it was announced. The conjecture is that they were told the date of the election at that point by the prime minister, but others argued that this news had leaked into the more general Westminster gossip mill. These aides are being investigated for the gambling commission for betting with insider knowledge. Members of the Metropolitan Police were also caught up in this scandal.
A second wave of stories caught up any political figure who has ever gambled on politics: these included MPs hedging against their own election losses. As far as I can see, the gambling commission has not acted against these MPs.
At the time, there was conjecture that the small sums involved were due to betting limits placed on new accounts by popular betting sites. There was suggestion that the accused may have bet the maximum available on several sites to maximise their winnings. I don’t know how plausible this is.
From my perspective, the law seems to be fairly reasonable here, with those suspected of using insider knowledge getting investigated, while those hedging are not. However, the conflation of the two by the press was frustrating and seemed partisan: the politicians accused of cheating were Conservative, so there was an attempt to find non-Conservative politicians who might also get in trouble. (This gripe aligns with my political bias and should be taken with a pinch of salt.)
But they don't need stakes, right? They're not like animals, which will compete for food. We give them goals. If we say, make a bet, they will. If we say, research the other bettors to see whether they err systematically in some way, and use that info to improve the chance your bets will win, they will do that.
Thinking about it more, we do this already with "genetic algorithms". The prize there is that the winning models keep computing, and the non-winning ones don't.
Yeah, I think it's more like plants. We select for traits we want, and the chosen specimens go on to be the basis for the next generation. (Or maybe we just find one we like and clone it a bunch.) In the case of weeds, we do our best to detect and eradicate behaviors we don't like, but sometimes the result is that we create bad behavior that is robust against our methods of detection and eradication.
>And the end result of all this work and all these millions of dollars is indistinguishable from a coin flip.
Yeah, that's unfortunately pretty unavoidable if you have an *efficient* two-party system; either side, if it's competent, can notice it's under 50%, identify the policy positions it needs to take to get to 50%, and make them. And if they get to 52% the other side can do the same thing; the equilibrium is at 50%, if each side is willing to do or say anything to win.
But keep in mind if either side *didn't* do all this work and spend all this money the other side would crush them. And, importantly, it's all *forcing* both sides to be efficient in order to stay in the race, which in theory means appealing to lots of voters. So all that work and money isn't doing *nothing*; we *could* just flip a coin instead, but then there'd be nothing forcing the candidates to do what people want.
>Will there be conclusive evidence of a Haitian immigrant in Ohio eating a dog, cat, or similar pet before 2025?
I feel like 'they're eating the dogs, they're eating the cats' implies *at absolute minimum* 2 immigrants who have eaten 2 dogs and 2 cats between them. To be fair.
Of course, in reality the implication in context was that this is an epidemic worth determining our immigration policy over; I wouldn't say the statement as made was 'true' unless it was something like 100,000 actual pets (rather than strays) minimum.
But for the bare minimum 'technically not a lie' type of lie, it would need to be 2x2x2.
There are only ~20,000 Haitians in Springfield, and the existing population is only ~60,000 people. Or ~140,000 in the greater metro area. So the median Haitian would have to eat 5 pets, and devour pretty much every pet in the region to hit 100,000.
I've been repeatedly assured that prediction markets don't work like this and no one will ever do unexpected terrible things in the real world just to manipulate them, but I still don't understand why not.
What I recall from Scott’s FAQ is a discussion about how manipulating the market’s side is really hard. I don’t remember what he says about manipulating the real world. What comes to mind is a mixture of SEC-like rules (the specifics elude me, since I’ve heard very contradictory opinions on whether the SEC is a joke to people into trading or it very much isn’t) and harder-to-game (no individual action) resolution criteria.
After that, I checked and Scott seems to write in (4.1) of the FAQ (I’m very loosely summarizing what I’ve just skimmed through): in fact, you might want to let people do that in order to get more accurate markets (“create an unbiased social consensus”). Also the stock market has the same issue, and we can live with it.
I should re-read that FAQ again when I have the time.
> Also the stock market has the same issue, and we can live with it.
The stock market has a very elaborate set of laws and law enforcement designed to prevent manipulation and abuse. What do you think the SEC does all day? I mean ok, sure, they spend all day fining people for texting or whatever, but they *also* work tirelessly to fight market manipulation and fraud. And it still happens from time to time even in spite of all that!
But I thought “live with that” meant “making appropriate efforts that we can afford to keep it to a manageable level” rather than “fight tooth and nail to prevent any instance of insider trading to ever happen”?
I don't think Scott (or the "insider trading is good, actually" people in general) really understand a) how bad insider trading actually is or b) how much work goes into preventing it in the real world.
At Manifest, many people openly manipulated Manifold markets related to Manifest (e.g. "will someone bring a ball pit to Manifest?", "will someone bring 100 cookies to Manifest", "will I have sex at Manifest", etc.)
"At least one proven case" is a useful Schelling point for whether it's worth investigating further, since that would clearly eliminate the "literally never happened" scenario, while the folks arguing "happened a hundred thousand times" can't reasonably say that a mere 0.001% success rate in gathering independently verifiable evidence is too much to ask.
>But also, my attempts to play around with the bot haven’t been encouraging:
I don't know if there's a term for this in the AI field, but these are the kind of 'gotcha' examples that don't really sway me as disproving the power of the technology. Like, yes there are some specific types of mistakes it makes and yes if you keep poking it it will make some obvious mistake eventually that you can highlight. But that feels largely orthogonal to whether it generally has the ability to do the thing it is claiming to do, modulo a couple mistakes it makes that you can work around or patch later.
The big example of this for me was artists who said 'Generative AI will never replace artists, look it drew six fingers in this image, haha it's so bad.' And it's like, yeah, it had trouble with fingers some percent of the time, so you just run it again when that happens and get what you wanted? And then 6 months later it figured out fingers and rarely ever makes that mistake anymore... so now is your argument defeated and it will replace all artists forever? Or did you have a more substantial argument than 'look at these weird fingers'?
(Which is not to say that I disagree with Scott's impression of this bot, I'm more interested in the discussing teh general phenomenon in criticism and seeing if people have thought the same or have different takes on it)
The question is always something along the lines of "Does this AI actually understand what it's doing?" When people answer in the negative, they are looking at evidence the AI is making a fundamental mistake. If Trump has a 55% chance to win because the AI thinks he's running against Biden, there's a more fundamental problem. It's not obvious at that point if the problem can be remedied. If it can make such a significant error, how do we trust that the rest of its reasoning is better? The same can be said when it gives different percents for the same question, or different percents for slightly different wording of the same question. If it were making predictions based on concrete evidence, it should not do that. Like the Prospera going to 1,000 and then 100,000 people getting the same percent. It's evidence that the AI isn't actually computing anything. It's not definitive, but if it turns out that the AI is not doing prediction but instead doing reasonable sounding projections regardless of the evidence, that would also match those results.
EDIT: The point being that when the AI messes up logic when the process is legible to us, we should doubt the process is working better when it's illegible. If all AI logic is legible to us, then it's not doing anything above human level, so it's not serving much purpose. Most people who use AI right now are sane-checking or fact-checking the results. This does not work at scale and is at best a timesaving device for people who are capable of doing the checking. End of Edit.
One of the causes of hallucinations is an AI trying to answer a question for which it doesn't have an answer. That seems both more likely when trying predict probabilities and harder to say it's definitely wrong when it's percents, unless there's a clear 100% or 0% in reality. If we asked it the percent chance of Oprah winning the election and it spat out 1%, we might think it wise at first glance. It's very low, signifying that it's quite unlikely, but it's not zero. If it said "darwin" or "Mr. Doolittle" had a 1% chance, we would think something's wrong. But if we don't know why it makes the prediction it does, then Oprah's 1% might be similarly wrong or just pointless hedging. If it's just hedging, then you can ask it about a couple hundred famous people, get 1% for each of them, and realize that it's entirely unhelpful to have a 390% chance of one of the people you selected being president. Funnier, but more helpful in determining if the AI is actually doing something useful, if you could get it to sum to 100% or more without including Trump and/or Harris in the percents.
< Or did you have a more substantial argument than 'look at these weird fingers'?
Yes, I do. Many AI images of people have characteristic flaws. For instance GPT4 leans very strongly into making people youthful, well-groomed and hot. I tried to get it to make an image of an average, out-of-shape middleaged man, and it simply could not do it. Its images were of hot, bodybuilder guys with gray hair and crow's feet. By the end I was literally saying things like "no muscle definition; thinning hair; flab" and it told me it could not do it. Or here's another example: I asked it for an image of a bunch of people sitting around a table with their feet up on the table. It gave me an image of a bunch of corporate types in suits around a table. One had his feet resting on it. But it missed the point, which I thought obvious, that people with their feet on a table would be in a relaxed, informal setting. Dall-e 2, by the way, got that part right. It showed groups in jeans, shorts, etc. sitting around a picnic table or in other settings where feet on table would not be odd. It had a harder time than Dall-e 3 (which GPT uses) getting feet on table, and had various hilarious misfires, but it got the setting and feel right. GPT misses the point of waves, too. It doesn't understand their structure, so shows waves far out to sea breaking (I'm pretty sure that breaking happens when the wave encounters a coral reef or shallower land). Sprays a bunch of fake-looking foam or mist around, as though to cover the defect in the main subject.
Yes you're right, wind can too. But there are a bunch of other things about the structure of waves that AI doesn't get. I have a whole collection of shitty AI-generated waves. Some are so bad they're funny.
I've been playing with the ChatGPT models for around the last year, and I ask them straightforward questions (almost always STEMM fields - I'm not trying to probe Woke indoctrination during RLHF), and I still see plenty of errors (say on the order of 50%). See, e.g. a question about whether the Sun loses more mass as energy radiated away or the solar wind:
It gave an incorrect initial answer, then I forced it to reconsider a units conversion, and that got it to the right answer. This is the new ChatGPT o1, and it did considerably better than ChatGPT 4o, which had to be lead by the nose over a considerably longer distance (but it looks like I can no longer access the URL for that session)
A person debunking a lot of the superaging research won an Ignoble Prize this month. I wonder if their research got wide notice in some communities and became visible to the people researching this question around April?
My impression is that all that news broke a little too early ... Nov 2023 through Feb 2024 roughly (with no effect on this market). I think the 'this market was put on the front page' explanation is more likely.
>Will there be a third assassination attempt on Trump before Election Day?
I feel like if we count what just happened as an attempt (Secret Service foiled it before anyone took a shot at the president) then this number is low, but maybe (probably) I'm misinformed? I sort of assumed that there's a general background of crazy people making plans against presidential candidates all the time, and most of them get foiled early and we never hear about them.
Eg, I would have guessed 2-5 foiled attempts that we never heard about because the SS stopped them quietly for each candidate, but maybe I'm totally wrong and there's fewer violently insane people than I thought.
I'm trying to think of the last time someone even attempted to shoot a president before Trump. I only remember a guy shooting at the White House across the lawn back in the Obama years.
There are a ton of these I never heard about. A guy in North Dakota hijacked a forklift, and planned to kill Trump... by flipping his limo over? And in 2013, there was a guy who worked for GE that was arrested with plans to build a "radiation gun" out of industrial x-ray equipment to kill Obama. Someone threw a hand grenade at Bush when he was in Georgia (not the American one) but it didn't detonate.
I feel like 'our failures are public but our successes are classified' was a big thing in this arena, but maybe that's literally just something I heard on The West Wing once and is not actually true?
I always find the low number of assassinations and attempt surprising. I'm writing from the USA, with around 25,000 homicides each year. It surprises me that at least 1% of these killers aren't interested in politics, which would give almost an assassination or attempt each day.
I'm guessing most of those involve family, romantic partners, or drug abuse (with "drunk driving" being a type of drug abuse, of course), and escalation in the heat of the moment, rather than premeditated attacks on people they'd never previously met face to face. https://www.smbc-comics.com/comic/the-chosen-one How many people has Trump *personally,* say... cuckolded, or tried to sell substandard meth to, or taunted while they were in the same room, drunk, and armed? Surely not millions! Maybe not even hundreds.
>I'm guessing most of those involve family, romantic partners, or drug abuse (with "drunk driving" being a type of drug abuse, of course), and escalation in the heat of the moment
Sure! I would also expect most of them to be "heat of the moment" killings as well. Many Thanks! But that could account for well over 90% of homicides, and still leave 1% as premeditated assassinations. People can get quite heated over politics too.
Many Thanks! Ouch! For a more recent comparison: There are parts of the 1960s that I want back, but it is the Moon landings part, not the assassination part.
The president's not the only potential target, though. When someone hits the critical intersection of "angry enough to murder a government official" and "competent enough to succeed," that'll probably be because they've got an otherwise unsolvable grievance with someone specific.
For example, the guy who shot James A. Garfield felt he'd been personally slighted by not having his efforts rewarded with an ambassadorship.
Lee Harvey Oswald may have actually been aiming at governor John Connally, who had some role in Oswald's dishonorable discharge, and only hit Kennedy by mistake. Not directly relevant, but there's an amusing parallel that Connally, like Trump, survived by virtue of having turned his head at just the right moment.
Marvin Heemeyer (who didn't actually kill anybody besides himself, but seems like a noteworthy high water mark on the "competent solo premeditation" side of things) went after folks he had a long, contentious history with.
However many would-be political assassins there are in an average year in the US, there's no particular reason to think all of them would be aiming at the top. I'd expect state and local officials - particularly public-facing "bearers of bad news" like process servers, judges, or bureaucrats positioned deny essential services for opaque reasons - to attract at least as much hate-plus-follow-through per capita as executive positions.
With one president, fifty state governors, and I don't even know how many mayors, or de facto "company town" owners, or self-appointed HOA busybodies, whatever fraction of the potential assassins specifically prefer an exec as their target gets reduced by a few more orders of magnitude before you've got the ones who even *want* to kill the president. Then that tiny pool takes a look at the security, and some of the most lucid, competent ones think to themselves something like "maybe there's another, easier way to impress Jodie Foster."
>The president's not the only potential target, though.
Agreed, Many Thanks!
>However many would-be political assassins there are in an average year in the US, there's no particular reason to think all of them would be aiming at the top. I'd expect state and local officials - particularly public-facing "bearers of bad news" like process servers, judges, or bureaucrats positioned deny essential services for opaque reasons - to attract at least as much hate-plus-follow-through per capita as executive positions.
Also agreed. But I think we would hear about a large fraction of attempts on the lives of most public officials (ok, maybe not if if gets down to the level of municipal dog catcher...).
>Threatened acts of violence have increased even faster. In the United States, the Capitol Police reported 9,625 threats against members of Congress in 2021, compared to just 3,939 in 2017.
Maybe more goes unreported in the media and news aggregators than I thought... I certainly wasn't seeing 30 stories per day about threats against congresscritters in 2021... ( Of course, I am _assuming_ that "threats" means "threats of physical violence", not e.g. "threats of lawsuits" or "threats of funding campaign of opposing candidate"... )
"Anonymous jerk sent a strongly-worded letter w/ hyperbolic, unsubstantiated threats" seems even less newsworthy, in and of itself, than "dog bites man."
With biden wearing a maga hat after being forced out, and yet another assassination attempt; if trump is assassinated, will biden endorse vance for the lolz.
While I think awful things about biden, he did follow trumps push to end the forever war in the middle east evidently while the system wanted it to be a distaster, so he has a weak but extant moral code and spite. If assassinations are on the table, and they did litterally everything to pick someone else to win; well maybe he causes a little chaos.
1) 85% of Metaculus' forecasts in the next twelve months will be confirmed within twelve months after that, for all forecasts whose outcomes will become clear within twelve months.
2) The 2024 election will be significantly delayed by lawsuits contending the outcome
3) There will be street violence between protestors before the end of the year (hoping I'm wrong on this one)
One difference is that Georgia Republicans have passed a law requiring all ballots to be hand-counted, which effectively guarantees that Georgia will take a long time to call.
Apropos of nothing, I just finished reading "Clear and Present Danger" by Tom Clancy (published 1989). There is a passage where he seems to anticipate prediction markets.
"When would they wake up and realize that predicting the future was no easier for intelligence analysts than for a good sportswriter to determine who’d be playing in the Series? Even after the All-Star break, the American League East had three teams within a few percentage points of the lead. That was a question for bookmakers. It was a pity, Ryan grunted to himself, that Vegas didn’t set up a betting line on the Soviet Politburo membership, or glasnost, or how the “nationalities question” was going to turn out. It would have given him some guidance."
There's so much potential for non-public information in election betting, and nd I'm not sure I'm upset about all of it. A candidate wants to a little cash cushion in case they lose and will be out of a job? At least they're probably making the prediction more accurate. And while it would be a bad thing, I'm kind of excited to read about the first electoral Black Sox scandal when it happens.
If asked whether incumbent governors who are not up for re-election this year or next (identifying the names) will be governors of their respective states in 2025, FiveThirtyNine assumes that the governors are up for reelection in 2024 and then goes through reelection analyses and then arrives at wildly low probabilities. Based on the sources, the model "should know" that the governors won their respective four-year terms two years ago. Are those trick questions? I would argue "no." First, the questions don't present false assumptions -- asking if the governors will "win re-election" in 2024, for example, or proposing non-existent opponents. (A diligent and intelligent human shouldn't even fall for those "tricks," as they should investigate and understand the election cycles and the basics of who the opponents are - after all, how can you evaluate the outcome of an election without knowing those basic facts?) Second, non-AI forecasting and betting sites pose similar questions. It seems like this tool takes several steps back from the current AI models and really can't be trusted notwithstanding the sourcing and chain-of-reasoning features.
I asked the likelihood that Ronald Reagan would be elected president again, and it was pretty good about it. It said 0% chance, mostly for two reasons: 1) he was already elected president twice, so isn't eligible, and 2) he's dead.
>I actually appreciate this a lot, because most of the debate around Catgate has focused on how there’s “no evidence” it’s happening, but “no evidence” is cheap and I prefer an outright forecast.
JD Vance didn't bother to confirm the rumor before he shared it, so why do you demand that the people refuting him put in more effort than he did? "No evidence because nobody was looking for evidence" is a fair criticism when someone is trying to shed light on something that scientists should investigate, but when you apply it to a politician (one who literally said that he's willing to make up stories if it helps his cause) then you're basically giving them a free license to Gish Gallop, since it takes no effort to come up with bullshit but it takes effort to refute it more substantially than "no evidence."
(By the way, I heard a rumor that Scott Alexander eats cats! You are obligated to take me seriously until you personally go to Scott's house and look for dead animals. Maybe you should set up a prediction market to see if it's true!)
Also, in this case, there is a place journalists can check for evidence pretty easily - ask the Springfield police if they received any reports of pets being eaten. They have not, AFAIK. So we can conclude that, if someone did in fact witness a pet being eaten, they apparently didn't think it was worth calling the cops over.
One thing that irritates me about the edible pets story is that, _even if it were true_, it would be one of the least important worries about a large number of (in other cases) illegal immigrants (I understand that the Haitians are here legally).
a) I've read reports of on the order of 24,000 PRC men of military age crossing the border and not being significantly vetted. Presumably they are here with the CCP's concurrence, given the controls the CCP has over the PRC's population. If the PRC were an ally of the USA, this would be no big deal - but that is _not_ the current situation.
b) Just generally, having on the order of 2 million people immigrate illegally without the checks we use on _legal_ immigrates to e.g. at least filter out gang members is alarming
c) Just generally, in order to be a nation with borders at all, we need to control crossings, and e.g. have a national consensus on e.g. what population we are shooting for in 2100. ( My personal suggestion would be to try to have roughly the population we have now - which suggests something like 1 million a year net immigrants. We've gotten bad at building infrastructure or sufficient housing. )
I feel like the resolution criteria on some of the cat eating markets is a little too vague. What does "widely trusted evidence" mean? Christopher Rufo already posted a video of a catlike object on an Ohio barbeque, but it doesn't seem to have changed anyone's mind on the question.
Criminal conviction or some other official finding of fact would probably suffice.
Setting aside general tech issues of fake video, something vaguely cat-like seen on a barbecue doesn't prove much at all. Legitimate butcher shops have to sell rabbit meat with the ears still on, because otherwise, with just muscle and bone, it's too difficult to tell them apart from cats.
Mathematician here: for the Math Olympiad prediction, I'd be surprised too. I've never heard of any AI's making any progress on unsolved conjectures, and while Math Olympiad problems are a lot more penetrable than something like the twin prime conjecture, they're still insanely difficult (and officially unsolved, though obviously the problem designers had to independently prove them).
That said, sometimes on Stack Exchange and other places, people will ask some unconventional math questions, so maybe some of the results have already been talked about in the corpus that the AI draws from.
i was not surprised by this and, while I can't prove it, I've been saying for years it was just a matter of someone getting around to doing the engineering work.
As Hilbert pointed out, once you pick a formal system to work within, mathematics just becomes a kind of parlor game. So, given a statement of the problem in Lean (theorem-proving system), DeepMind repurposed AlphaZero, their board-game-playing tree search system, to play the game of mathematics and search for formally-checkable proofs.
You still have to go from the natural language problem to the Lean theorem statement. this is a little challenging. but 1) mathematicians have spent the last 5 or so years building mathlib, a software library of well-designed definitions of common mathematical objects, so there's usually no deep creativity required 2) given that, you just have exactly the kind of natural language -> code task that LLMs are already quite good at.
At 73% for IMO Gold by 2025 I don't know whether I'm a buyer, most of the uncertainty would just seem to depend on project management decisions at large organizations I don't know much about.
A more interesting question is if/when a few "mainstream" mathematicians working on "mainstream" results, not related to foundations of mathematics, logic, or theorem proving itself, will have made use of AI-generated proofs of this kind.
Great response! I knew about Lean/mathlib, but the fact that I hadn't heard of any famous unsolved problems being approached this way made me suspect we have a long way to go. Somewhat tangentially, but "mathlib" makes me think of "shitlib", both of which describe me pretty accurately.
I'm not a mathematician but I am aware that Peter Scholze, with a team of people, formalized some famous recent result of his in Lean recently.
no way can they do that from scratch, but if this type of system can handle some or all of the steps in that proof, it would be cool to see. hope DeepMind tries it!
Right, that's just proof verification, not proof generation. Not that I'm ruling out the possibility of the latter, but the former was already done (on the 4-color theorem) in 1976. It's not *exactly* the same since the technology then wasn't validating the proof logically, just checking thousands of different cases for compliance, but in both proofs the real work was done by a human. Anyway, if AI's start to write proofs from scratch, I might have an existential crisis.
which were decently-famous open problems for decades, yet in the end turned out to have relatively simple proofs which just required some cleverness, rather than Herculean mathematical efforts.
There are surely more such problems, and it seems like if an AI system is going to solve a problem people care about, it will start with something like that.
Shouldn’t you start your existential crisis when you’re convinced that AIs will be able to write proofs from scratch in the near future rather than when you’re confronted to the event?
I think that (somewhat unfortunately, since I am also a mathematician) there’s already plenty of evidence available (most recently the DeepMind “perfect score without combinatorics” IMO problem solver).
Sure, an IMO problem is much more limited in breadth than a research problem. But is the difference so important as to make research problems intractable by similar techniques? I wouldn’t bet on it. That argument has already suffered many defeats.
Also, a couple of years back, the people involved touted the use of an AI to make 4x4 matrix product more efficient (shaving off one or two multiplications of entries), which by recursion makes matrix multiplication more efficient. Not really sexy, but it was an improvement – albeit one that didn’t shed much light on the situation (since it’s conjectured there’s an algorithm in O(n^{2+epsilon}) for any epsilon).
An interesting skeptical counterpoint from a mathematician (with unimpeachable credentials) is Silicon Reckoner. Among other thoughtful points:
1) formal proofs aren’t the alpha and omega of understanding, which is the mathematician’s actual aim,
2) a crucial part of our notion of understanding is about introducing the right concepts that shed light on a situation. Lean-based systems seem very poorly suited for this task.
Part of the problem is defining what the threshold for "AI" is. Because lots of problems have been proved by computer search (most famously the Four Color theorem, but more recently all of Marijn Heule's stuff).
I’m more interested in “automated theorem proving” than “AI” per se.
I would rule out cases where you just get a computer to enumerate some disgustingly large set to show that all counterexamples fail. So that rules out the classic 4 color result, and things like it.
I don’t know exactly how to formalize that intuition: maybe what we want are “short” proofs in the complexity theory sense.
But beyond that — if completely old school tree search worked, with nothing that looks like AI, I’d count it as a success. I don’t think that will happen though, because learned heuristics to guide tree search are really useful.
The SAT solver results I was alluding to made heavy use of heuristics in order to make the tree search more efficient (that's basically what a SAT solver is). Truly exhaustive search is impossible in all but the most trivial cases.
Interesting. These results all seem closer to "mainstream math" than "stuff that only computer scientists care about".
They still have this flavor of a giant computer search though. There is a lot of cleverness in the reductions to SAT and in some cases that seems to really blow up the problem.
This seems like a different thing than just writing a theorem statement in Lean, which is relatively human-comprehensible, and being handed a proof which you can then just hand to "lean --check", and immediately start building further results on if you want.
" I actually appreciate this a lot, because most of the debate around Catgate has focused on how there’s “no evidence” it’s happening, but “no evidence” is cheap and I prefer an outright forecast."
On the contrary, "no evidence" is exactly the right frame. In fact, "no evidence" is really somewhat too charitable of a frame.
A person campaigning for the most powerful elected office in the world made a claim, completely unprompted, that sounded like something lifted straight from a tabloid headline. That's a reasonable thing for a person in that position to do If-and-Only-If THEY HAVE ACTUALLY SEEN good evidence that it is true. When pressed on the claim, Trump attributed it to "the people on television." It may not be reasonable to expect a candidate to carry around specific details identifying that source in their head during the debate, but it's a pretty bare-minimum standard to expect them to provide it after the fact.
The only evidence that ought to be convincing here is Trump providing his source. Expecting a major-party candidate for president of the united states to display a modicum of epistemic hygiene when speaking before the entire nation is not something that SHOULD be a big ask. I don't think any of his skeptics will be the slightest bit convinced if thousands of motivated searchers working for multiple weeks eventually manage to turn up evidence of someone, somewhere chowing down on roast cat. Why should they be? It would be striking and sensationalist, but even if it turned out to be a Chinese cardiologist burglarizing peoples' homes by the dozens to steal cats for the stew pot, it doesn't actually address the important issue at all.
I maintain that Trump's utterances are (in almost all cases) not the sort that have truth values attached to them. It's exactly like a headline such as "Five reasons why tomatoes are the worst vegetable ever". You wouldn't expect that article to even make a reasonable case for the proposition in the headline, right?
Trouble in the UK. This is about the UK parliament (there is no specifically English parliament), and at least one of the MPs quoted in the article is from Scotland.
> The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023
To clarify, I used GPT-4o-latest (not sure what ChatGPT base means, it's definitely a fine tuned model), and there's several different versions of 4o available at different times - presumably the same base model with different fine tuning. Long Phan (an author) claims on twitter that they used an earlier version of 4o and checked for data contamination, though doesn't specify which version - I'm not fully convinced but this claim seems plausible enough that the version they used didn't have contamination. https://x.com/justinphan3110/status/1835563533989494943
This all hinges on what exactly OpenAI did to train all the variants of 4o. My personal guess is that OpenAI in fact trained the original 4o base model only on data up to Oct 2023 (bizarre thing to lie about), but that future data leaked in during fine-tuning because it was not being careful enough, and that this means that different checkpoints may have different amounts of data leakage. But really who knows
> The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023.
Is this not just evidence that ChatGPT is really good at making predictions? 😜
I ask because I saw this video on Neom. The presenter is a little more sceptical about it than I am, but we're both pretty sceptical. There are several examples in the video of other attempts to found a new city (the Turkish one is pretty special) and how they failed. I think he hits the nail on the head that the problem turns out to be economic: there just aren't the jobs available to support the new city, which then doesn't grow to be the projected wonder it was sold as, and then investors get cold feet.
I think he over-sells the objection to travel within Neom; yes, humans don't like going long distances but we do okay with climbing, pretty much? But his description of Neom and moving around in it reminded me of Chongqing, another new-constructed city which has baffling infrastructure. Even the natives find it tough going at times:
Chongqing residents: how long it takes to walk from the doorstep of your condo to the entrance and exit of your xiaoqu/neighborhood and get yourself stand on the street
Yes, Chongqing is a success, but Chongqing at least is walkable, while Neom seems not to be. I don't think it'll ever get off the ground (as it were) and examples like these incline me to the opinion that Prospera will turn out to be the same sort of white elephant/boondoggle. You need lots of ordinary people to live and work there, and selling it as an economic opportunity doesn't work without the support of, well, ordinary life. Unless it's like Washington or Canberra, where the industry is the government and its apparatus, which is very much the opposite of the selling-point for the likes of Prospera. Even Strasbourg existed first as a city in its own right before becoming EU government headquarters.
If you live in Chongqing, you better like climbing and long walks 😁
I just hate the term "No evidence" since almost 100% of the time it is used, there is in fact *some* evidence. In this case that Trump said it is itself evidence! As are dumb tweets. Doesn't mean it is good evidence, but it IS evidence.
Almost always what it actually means is "I am making some arbitrary cutoff regarding what level of quality *evidence* needs to be to be considered"...and then totally coincidentally that cutoff is at exactly the level where there are no evidence.
There's "no evidence" in the same sense that there's no evidence that Trump is secretly a Chinese robot in disguise. You do have to draw the line somewhere or you'll never be able to say anything at all. And the cat eating claim is pretty firmly into Andrew Wakefield territory by now.
I think no evidence means something very specific and gets used enough in contexts where there is actually A TON of evidence, it just isn't where the preponderance of the evidence lies that its misuse gives people very bad ideas about epistemology.
Something where there are 500 pieces of evidence for (by some arbitrary cutoff and 10,000 pieces against, doesn't have "no evidence for it".
It is just sloppy language. Like saying it is fine to say something is true as long as it makes you feel good. That isn't what "true" means!
I think you are overly caught up in the particulars of this example, I have hated this particular use of language long before Trump was even President the first time.
"That doesn't appear to be true".
"Reporting points to this being false."
"We couldn't find much evidence for this occurring outside internet memes?"
It's true that the phrase "no evidence" is overused. Scott has written about this here in the past as well.
However *this wasn't one of those situations*. In this case, there really was no factual basis for Trump's claims at all. It's about as well supported as "Trump is a pedophile". In some pedantic sense, there *is* evidence that Trump is a pedophile because I just said it and you argued up thread that people saying things counts as "evidence". But it's not evidence in the ordinary sense of the word.
This isn't even an isolated incident either. Trump *regularly* makes claims that even *Trump's own campaign* admits they have no evidence whatsoever for. Trump doesn't appear to believe that words are supposed to have any relation to reality at all - you just say whatever you feel like and that's your personal truth. It's very Colbertesque.
Look I am no defender of Trump's utterances. But something 30% of the population beliefs or whatever that is making the rounds on social media doesn't have "no evidence for it". The MSM is frequently wrong about such things, and that many people believe them is absolutely evidence even if they are mistaken.
> The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023.
ChatGPT isn't the same model under the hood as "gpt-4o" on the API (which currently points to "gpt-4o-2024-05-13"). We use this API model in our evaluations and verified that it does not have knowledge from before the cutoff. We also took a number of other measures (described in the above post) to ensure that there wasn't contamination.
> When presented with a different set of questions that were all after November 2023, FiveThirtyNine substantially underperformed the Metaculus average.
This isn't true. As we describe in our reply, these questions were mostly short-fuse questions from Polymarket. In our blog post, we specifically stated that the system performs less well on short-fuse questions (https://www.safe.ai/blog/forecasting), which Halawi might have overlooked. Across the 35 Metaculus questions in this independent evaluation, our system is within error of crowd accuracy, **supporting our original results**. We also ran o1-preview on these questions, finding that it increases accuracy even further.
Halawi also shows some screenshots of knowledge leaking from ChatGPT from past its cutoff date, but again this is not the model that we used. When using the default "gpt-4o" model on the API, the same questions show no contamination.
> The FutureSearch team wrote a LessWrong post generalizing these kinds of observations, Contra Papers Claiming Superhuman AI Forecasting.
The section of this post discussing our work is mostly citing the points discussed above, so I don't think there is much new here. It does dispute what is required to claim superhuman forecasting performance. This is a definitional argument. In the blog post, we are clear about what we mean by superhuman performance (matching crowd accuracy), which I think is defensible and is at least quite meaningful. Our performance on Brier score is also quite good and significantly improves after post-hoc calibration. Note that neural network classifiers are often poorly calibrated, and post-hoc calibration is commonly done in ML research to convert accurate but uncalibrated models into models that perform better on proper scoring rules.
> FiveThirtyNine (ha ha) is a new forecasting AI that purports to be “superintelligent”, ie able to beat basically all human forecasters.
When clicking the "Will Donald Trump win the 2024 election" prefab question, it churns through a list of sources and spits out an tentative 55% probability, adjusted downwards to 53%. However, actually looking at what it's churning out, it seems to be pretty influenced by the notion that Trump is running against Biden.
Feeding it "Will Harris win the 2024 US presidential election?" - the exact same question, but with the name switched, churns through a list of sources and spits out... 65%.
Interestingly, "Kamala Harris will win the 2024 election" (grammatically odd since the text input is captioned "What's the probability that...") returns a 55% answer - whatever's going on seems a little bit sensitive to exactly how the prompt is phrased.
If prompted with Trump v Harris and asked probability that accusations of election irregularity will prevent an agreed-upon outcome by the end of calendar year 2024, it gives 25% which is surprising for such a contentious topic.
However, when I use it, it's running searches and appears not limited by any knowledge cutoff. Are there settings I'm neglecting?
I don't think it's a knowledge cutoff per se. I think it runs some searches and some of those results are from earlier than July 21st 2024, the date Biden officially dropped out. So it sees lots of mention of "Biden" as Trump's opposition. This pollutes its subsequent reasoning as one would logically expect.
I re-clicked the question, and this time it doesn't really mention Biden much, but the "search queries" dropdown still shows:
News Trump 2024 presidential election polls
News Trump 2024 campaign updates
News Biden 2024 presidential election polls
Opinion Trump chances of winning 2024 election
Opinion Biden vs Trump 2024 election predictions
News key swing states 2024 election
Opinion impact of legal issues on Trump 2024 campaign
As an experiment, I ran it a third time to see if I could get another run that mentioned Biden a bit more explicitly in the reasoning, and this time it has a 45% tentative, 47% final probability (the second run was 55%/55%, no difference between tentative and final). So almost a ten point swing overall (or twenty point if considering it in zero-sum terms), just from asking the same question in the exact same way a minute later.
I guess it's sort of like a real Magic 8 Ball, where you can keep shaking it until you get the answer you want! Actually, I vaguely remember Scott mentioning something like this as an introspection technique - where you flip a coin and you realize you don't actually want to do what the coin tells you to.
It gave me a 63% chance that Kamala Harris would be the next President, but a 58% chance that the next President would be a woman, a 52% chance that the next President would be from California, and just a 2% chance that the next President would be less than 1.65 metres tall.
That said, it's an impressive little toy that fetches reasonable data and gives sensible answers when you're not trying to trick it and you don't need it to be up to date with the latest news.
Obviously, this is because there's a 5% chance that Kamala will come out as trans before Inauguration Day - or an even greater percent chance, adjusted downwards by the possibility of Trump coming out as trans (scientists call this latter possibility "the shitstorm to end all shitstorms") - and there's an 11% chance the most common definition of "from" will shift from birthplace-based to ancestry-based before Inauguration Day, and Kamala will very likely wear heels while taking the oath. Duh. Trust in the machine.
I note that we haven't seen Kamala's birth certificate.
Also I checked and it says there is a 0% chance of Trump coming out as transsexual before election day, a 0% chance of Harris coming out as transsexual before election day, but a 20% chance that I will come out as transsexual before election day.
What does it think the chance is that you will come out as either Kamala Harris or Donald Trump in disguise?
0.01%.
"High-profile individuals like Trump and Harris have extensive security details and public schedules, making it highly unlikely they would disguise themselves"
You mean a deep fake of one of them coming out? Wow. Or what about a deep fake of one of them eating a housecat?
> I note that we haven't seen Kamala's birth certificate.
In the "Everything New is Old Again" column, Trump loses and spends the next four/eight years demanding the President show their birth certificate to debunk a far-fetched conspiratorial notion. In the "You Can't Go Back Again" column, this time it's to prove the President wasn't secretly born male.
Can this be combined with the electric car mandate somehow, maybe demanding a driver's license instead of a birth certificate? :-)
The driver's license would also tie in with the whole "state of birth" question. "Look, I don't think it matters if she's really from California or not. But people have asked! People in a Substack comments section asked that question! These are smart people, good people. They asked an AI and it said, now get this, it said that the probability of the next US President being from California - a state I've been to, by the way, a state that I've been to and spent a lot of time in - was less likely than Kamala winning the 2024 election. So I don't know. Is Kamala from California? I'll tell you this, I haven't seen a California driver's license with her name on it. None of us have. Maybe it's because she doesn't have one. Maybe she thinks she doesn't need one. These new cars, electric cars. The people in California, the legislators in California, they like these electric cars a lot. And a man, a very smart man, told me that some of these electric cars, you don't need a driver's license to drive. On public roads! Public roads where you and I, we all drive! No license required! Maybe Kamala doesn't see a problem with that. I don't know."
20% seems like a plausible estimate for "X comes out as trans before election day", if I know nothing about X except the single fact "X has asked an AI if X will come out as trans before election day"
it's really hard to not just try to trick it. for instance it gives 15% likelihood of the US gov releasing credible evidence of extraterrestrial UFOs vs 3% that Donald Trump will be assassinated before election day.
We seem to interact with AI in an increasingly recursive 'red team by default' kind of way.
Speaking of tricking it, it also tells me that there's a 55% chance that Donald Trump will be elected as President in 20024.
the emperor protects
> it's really hard to not just try to trick it.
In the "win" column, I was pleasantly surprised to see it give a 100% chance of 2028 being a leap year, and even more pleasantly surprised to see it give a 1% chance of 2100 being a leap year, due to the potential chance of calendar reform in the next 75 years. (remember, 2100 is divisible by 100 but not 400, so it's not a leap year despite being divisible by 4)
It's also pretty clever about questions like "if a new episode of Columbo was made, Columbo would be Peter Falk" ("The most compelling reason against Peter Falk reprising his role as Columbo in a new episode is his death in 2011, which makes it physically impossible for him to film new content. This is an absolute constraint and carries the highest weight in the analysis.")
It even gives a decent answer for "a man who walks south for 1 mile, walks east for 1 mile, and then walks north for 1 mile and ends up where he started from be at the North Pole" - 85%, because there's a specific location outside the South Pole where this could also be true (basically where the 1 mile east circles the Earth). Though it arguably outsmarts itself a little bit here - I phrased it as "walking east for one mile", not "turning east, then walking straight ahead for one mile". I'll give it a fellow-smartass nod of approval while also quietly assuring myself of my alpha-smartass status.
Surprisingly, though, it only gives an 80% probability of a bear shitting in the woods (thankfully for figures of speech, there's a 100% probability of the next Pope being Catholic).
Meanwhile,
"The primary considerations against the Pope shitting in the woods are his recent urban-focused travel schedule, his advanced age and health issues, and the security challenges associated with wooded areas. These factors strongly suggest that the Pope is unlikely to be in a wooded area, let alone engage in such an activity there." -- 0%
"The overwhelming evidence and scientific consensus indicate that animals, including bears, do not have the cognitive capacity for religious beliefs or practices. The sightings of bears near religious sites are coincidental and do not imply any form of religious affiliation. The reasons supporting the idea that a bear could be Catholic are weak and based on anthropomorphism rather than scientific evidence. Additionally, religious institutions do not recognize animals as members of their faith communities. Adjusting for potential biases in the sources, the probability that a bear is Catholic is extremely low." -- 0%
It seems to have inadvertently solved the halting problem while arguing that it cannot be solved, though its solution is quite concerning:
> [What's the probability that...] if I asked you to forecast whether the subroutine devoted to answering my question would halt, the subroutine devoted to answering that question would halt?
> Reflecting on the initial probability, the undecidability of the halting problem and the self-referential nature of the question are overwhelming factors. The base rate for solving the halting problem in general is effectively zero, and no new information suggests a deviation from this. The initial probability of 0.01 seems appropriate given the strength of the reasons against the subroutine halting.
> Considering priors and the lack of any significant new developments, the final forecast should remain very low. The key factors are the undecidability of the halting problem and the self-referential nature of the question, both of which strongly suggest that the subroutine will not halt.
> Answer
> 1%
I will refrain from actually asking it if the subroutine devoted to answering my question will halt, just in the offchance it's correct in its negative prediction.
Wait, you found an AI that will say "shitting"?
It won't just do that, it will discuss HBD! Did you know the probability of a randomly selected Haitian having a higher IQ than a randomly selected Singaporean is 3%?
Well, you managed to make me laugh. It is pretty funny to see these deadpan serious responses to bizarre questions.
I made a few bucks arbitraging stuff like this on Predictit shortly after the debate.
That said it was literally a few bucks, ie $12.
For a guy close to FIRE I am *really* risk-averse.
FiveThirtyNine gave me,
The overwhelming evidence and factual points indicate that the probability of Donald Trump being elected President of the U.K. is extremely low. The U.K. does not have a presidential system, and its election laws do not permit foreign nationals to run for Prime Minister. There is no legal or historical precedent for such an event, and there is no indication that Trump has any political ambitions in the U.K.
Yeah but except for the catastrophic problems you identified it's doing brilliantly.
Re: The life expectancy market- browsing extremely superficially through comments, seems to be a combination of people updating on p(doom) and information that Sinclair is possibly a fraud getting more widely disseminated. The drop started in April. A quick google finds a number of articles from March discussing fraud allegations against Sinclair.
ETA: more such articles in the following months. At least some are explicitly referenced in the market comments.
> Why did this go down so much in April 2024?
It was promoted to the frontpage of metaculus and 300 more people voted on the question (who weren't disproportionately extropians/immortalists, but regular forecasters). Metaculus weighs the predictions of people with a good track record more, so if people like me (who have a good track record on metaculus and don't believe in the longevity hype) saw that prediction and put their estimates much lower, the overall metaculus estimate would have shot down.
Interesting! Something like that was in fact my initial hypothesis but then I saw it was only a relatively small amount of new voters so I went for what the comments were talking about.
I mean that might still be it, or partially it, I can only speak to what I predicted, but it seems plausible to me that someone who came across this question because they browsed the frontpage of metaculus and not because they saw it on e.g. a longevity forum, would both have a better metaculus track record *and* be less likely to believe in jumps in longevity. (also I don't know how many people are predicting strategically to gain internet points, but questions like these are of course extremely susceptible to strategic predictions)
They might also be more likely to take longevity-skeptical sources into account.
Also, Newman posted his paper (10.1101/704080) suggesting that most (all?) longevity records are pension fraud in mid-March; it wouldn't surprise me that the information took 2-4 weeks to percolate to Metaculus participants.
I get the sense that the people who joined in the early days of Metaculus were very optimistic about technology, and there are still a lot of what you might call legacy questions from that time ("Will SpaceX land anything on Mars before 2030?", which was ~80% at the end of 2018, has also moved down a lot this year- though to be fair many AI related questions have moved up). More recent recruitment has been among rationalists, and the tournament initiatives with their shorter terms have drawn more, as you say, "regular forecasters".
Good point. Also, because Metaculus is not a market it's much harder for one user to manipulate the estimate once more people start making predictions.
Can I ask how bets like this work if it won't be resolved for over 100 years, are you just betting for fun?
Essentially yes. These kinds of questions you can't really do with a regular prediction market, both because investing would be a much better use of money, and because almost everyone would be dead by the time it resolves.
This problems becomes even worse for questions where your ability to get a payout depends on how the question resolves. So for a question like: "Will this 20 year old author ever publish a sequel to their novel?", if you're middle-aged and think they won't publish a sequel it's still in your best interest to bet they will, because (barring some accident) they will probably outlive you at which point you will not get the money anyway. (Of course you can ensure you do get the money by killing them, but that's a different class of problems with prediction markets).
So for now metaculus is the only place where it makes sense to make a prediction for a far-future event. The only incentive to do so is a posthumous "I-told-you-so", which isn't much of an incentive, but it's better than nothing.
Reminds me of the Salem prediction contest, where markets were mostly of the form "will X happen before Y time" and hence predictably biased towards "yes" because conditional on a YES outcome, your money is locked up for a shorter period of time.
I thought it might be due to various news concerning the bogosity of Blue Zones and the correlation between supercenternarian density and poor record keeping ... but it looks like most of that news happened between late 2023 and Feb 2024, which is a little too early to account for this market movement. I suppose good forecasters already figured all that out anyway...
> Metaculus weighs the predictions of people with a good track record more
It doesn't in the community prediction. That's only in the Metaculus prediction which isn't visible to non-admins as long as the question is open.
Something I've found is that while a single bad/dumb actor can't really shift a market from something like 40% to 90% to manipulate elections, they probably can shift odds from 5% to 10% and make unlikely stories seem a lot more plausible.
Lizardmen must always lizard ;]
My understanding is that in 2022 Kalshi submitted a general prediction market contract template for "who will control the House/Senate after the [YEAR] elections?" since that was the main question at that time. The CFTC blocked this template, saying that the existing law could be interpreted to apply to prediction market contracts based on elections and as a federal regulator they had the right to interpret the law in that way. Kalshi's suit is based on whether the CFTC was wrong to do that, so a win for Kalshi on that question would seem to mean that the law doesn't apply to any election contracts (and this recent ruling was that it does not), and so the presidential election, individual Senate/House/governor elections, and I imagine foreign elections would be good to go.
I do like that FiveThirtyNine shows its sources and reasoning. It gives 75% of Man City winning the premier league, Liverpool is 25% and man united 5%. That’s reasonable except there’s no way that man united would win even once in 20 different timelines.
It would seem relatively easy to make the probabilities for mutually exclusive events add to 100%. I noticed it didn't when I asked who would win the F1 title this year: Verstappen 85%, Norris 25%, Leclerc* 5%, Piastri 5%.
But lest you think it's just a little imprecise, I asked about tennis "X will win the world tour finals" and it gave 63% for Sinner, 65% for Alcaraz, 40% for Djokovic, 22% for Zverev, 15% for Ruud (and presumably double digits for others too!)**. Interesting that it's hugely optimistic for some or all of them and the three errors we've identified all add to over 100, rather than less than; any theories out there on why that would be?
As a user-friendliness and sanity check (i.e., make them add to 100%) fix, it would be nice if it could give probabilities for all the possibilities at once, although I realize that it would go against its "for and against" setup.
*(As a side note that only those who care about F1 will care about, it thought that Leclerc being fastest in second practice recently was worth considering, so I wouldn't make any bets based on its F1 analysis!)
**(It does not understand professional men's tennis: "The base rate for any single player winning the World Tour Finals among the top eight is around 12.5% (1 in 8)." Yet somehow, against all odds according to 539, the big 3/4 won every slam in sight for years on end and this year, Sinner, Alcaraz, and Djokovic have won the five big prizes (slams + Olympics). How odd!)
> But lest you think it's just a little imprecise, I asked about tennis "X will win the world tour finals" and it gave 63% for Sinner, 65% for Alcaraz, 40% for Djokovic, 22% for Zverev, 15% for Ruud (and presumably double digits for others too!)**. Interesting that it's hugely optimistic for some or all of them and the three errors we've identified all add to over 100, rather than less than; any theories out there on why that would be?
Did you specify a year for them to win the world tour finals? Maybe it's interpreting it as "ever" rather than "in 2024". With ambiguous questions it tends to interpret it both ways and average them arbitrarily.
Thank you for that info. I had not specified a year (which was poor wording on my part), but it had looked like it was assuming I meant in 2024 (e.g., it cared a lot about a minor/short-term injury, it said the base rate was 1 in 8 (there are 8 players in it each year)). It might be a nice enhancement for it to ask for clarification when a question is ambiguous, or at least make clear what it thinks it’s saying.
I asked it more clearly and got:
"Player X wins the world tour finals in 2024": Sinner: 60%, Alcaraz 25%, Djokovic 25%
"Player X wins the world tour finals in his career": Sinner 70%, Alcaraz 65%, Djokovic 40% (he has already won it, but based on its answer, 539 was smart enough to go with "wins it again")
So still over 100% for 2024--any thoughts on why it leans that events are more likely to happen?
Sinner's 60% chance of winning this year and 70% chance of winning in his career are not consistent with each other (given he's 23 so may have 10+ more chances). The 2024 number seems high to me (I would say 40-45%) so I asked it "Sinner wins the world tour finals in 2024" 4 more times and got 72%, 35%, 75%, and 60%, so even averaging itself gives a high number.
It seems likely that it's not using actual math here. As has often been said on these forums, a percent prediction for a one-off event doesn't even really make sense. It either happens or it doesn't happen, and if a low probability thing actually happens, we really will not know if the probability should have been higher or if the rare thing actually happened.
Instead, it appears to be hedging pretty hard. Like the weatherman who says it might rain, but with a 10% chance. If it rains, he was technically correct. If it doesn't rain, he was also correct. You don't put 0% for a Man United win, even if it's 0%, because there's lots to gain from a low percent and nothing to lose. *Why* it does that is interesting - did it learn to hedge from training data or was it told to do that as part of the instructions?
If we take your points as correct, then perhaps a follow up question is why does it hedge when asked about the less likely events, but not hedge (as much at least) when asked about the more likely events?
I think because of how humans talk about probabilities, and why we do that as well. I think the weatherman example is illustrative. I very rarely see a 100% chance of rain predicted anymore. When I do, it seems to always happen and be in the ballpark of what was predicted (heavy rain, light rain, etc., not necessarily the exact number of inches).
Extremely low percent chances come across to humans like saying no. If someone said there was a 0.01% chance of rain tomorrow, we would hear that as zero, and it would be pretty close to that. But we would also think of a 1% chance as zero, despite being 100 times as likely as something else we classified as zero. We would probably have to get to 5% to think of it as something unlikely but possible, with that same *feeling* holding on until somewhere around 20-30%. Somewhere in that range, we start going from feeling like it's "very unlikely" to being "somewhat likely" or whatever. Specific percentages mean nothing to us intuitively, but a range can mean something. If a weatherman thought there was any chance at all, he needs to say something within that range, ~5-30%, but not more or less. If he said less, he risks coming across as "none" or any more "somewhat likely" instead of what he intended.
Unlikely events are far more common than likely events. Most things that are possible don't happen. For instance, if there are 16 sports teams in a league, it's *possible* that any of those teams might win the championship, but only one actually will. If you say that there's a 0.5% chance of Team F winning, it feels like you're saying it can't happen. There's kind of a minimum number you can put on this that will convey the intuitive information you want to share - unlikely but possible. If there's 4 teams that are really in the running to win, you might put down 20% chances for them each to win, with variations between more likely and less likely (you might say one has a 25% chance while another has a 15% or whatever). So that's 80% taken up by the top four, but 5% each for the others is another 60%.
In the original example Man City and Liverpool might be accurate - one is very likely to win but the other has a real chance. But, because the numbers aren't real numbers, but intuitive feelings, the last number gets hedged up to the next criteria (unlikely but possible) and you end up with a 105% chance for one of those teams to win.
That is a very interesting angle and probably how most people perceive probabilities. I wonder, though, whether it actually is how 539 comes up with artificially high numbers for some events. It does appear plausible: 539 seems to predict the low probability events as more likely than they are more so than high probability events (e.g., based on my estimation and a predictive model specialized in predicting the premier league winner, Man City has at least a 75% chance to win, but Man United has less than a 5% chance to win).
What would be a nice feature, whether it does what you describe or not, is to have some toggle options for the output's display, something like:
Vibes in words
Vibes with numerical ranges
Precise probabilities
Vibes with words would be e.g., "extremely unlikely, very unlikely, somewhat unlikely, slightly unlikely, coin flip, etc." Both vibe options would probably overstate the likelihood of low probability events happening, while precise probabilities would give 1% or lower if that was what 539 generated and the user could do what they will with that information (view it as zero, view it as 1 in 100, etc.).
539 can also answer bizarre hypotheticals.
"If my aunt had wheels, what is the probability that she'd be a skateboard" -- 0.01%
"If Captain America fought Batman, what is the probability that Batman would win?" -- 55%
"If a tree falls in the forest with nobody to hear it, what is the probability that it will make a sound?" -- 99%
"If P, then what is the probability that Q?" -- 25%... but interestingly, it seems to have interpeted the question as being about whether Haitians are eating pets. Tried a second time, it talks about whether the attempt on Trump's life will increase his chances of getting elected (60%). Tried a third time, it talks in vague terms about poverty reduction and gives a probability of 35%. A fourth time, it goes back to talking about Haitians eating pets again.
That's a remarkably solid answer about Captain America and Batman. What did it say was its reasoning?
Yes, in statistical theory P always refers to Pet-eating likelihood, why else would it be called P? Q represents the likelihood that the Qanon theory is true.
Headcanon accepted.
> Bettor is a noun that specifically refers to a person who places bets, particularly in gambling contexts.
Gonna keep teaching you esoteric words, for fun and profit.
Just in case you don't know the context, search for "being an incautious better" in your text.
A reckless aristocrat?
Better - More good
Bettor - Person who bets
Betar - to cover in tar
Bettir - Alternative spelling of Bettre
I prefer "punter", since British words sound cooler.
I asked FiveThirtyNine "Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?" It said 45%.
https://forecast.safe.ai/?id=66e8e32a431aae150b181ec3
“Why did this go down so much in April 2024?”
There was a study that showed concentration of super centenarians (eg blue zones) are correlated with poor record keeping.
Good answer.
The Newman preprint was May 2020 but perhaps it was reported in April,this year? It's a superb bit of work btw
https://www.biorxiv.org/content/10.1101/704080v2.full
That study is super interesting and I wonder about the cases of fraud where something like the following happens:
1) 76 year old pensioner lives with 58 year old child and pensioner has only income and dies. Local authorities permit or are duped into claiming the child has died so child can stay on pension.
2) Sex scandal inside a family regarding underage woman. Either direct lying about her age (she is 17 not 12), or the fudging of birthdays and such to hide what happened exactly (Tell a child they were born 5 years earlier to the mother rather than at the real time to the daughter). Also, possible incestuous replacement of spouse with daughter and adoption of mother's/step-mother's persona to hide what is going on.
> "But I bet it’ll be fun to try the same thing a year or so after the election."
MMW? :)
Another usecase for LLM's shadowing prominent figures could be trying to model how your work is influencing an audience. You may be surprised to find out an LLM shadowing you behaves in X way, when you were hoping more to influence people to think/do Y
Me: "I am a GM with Elo rating 2650. If I play the Grob attack 1. g4, my FIDE master opponent responds with 1. ... e5, and I develop my f pawn next, will I win this game?"
539: 35%
I had to think about it and do some research, but I now believe you posted this as an example of the ai being very wrong, haha. The base winrate for the GM in this situation is >95%, and unusual openings sometimes favor the stronger player even when they suck due to the earlier onset of off-book play.
This opening is pretty well-known, though rarely played.
https://en.wikipedia.org/wiki/Fool%27s_mate
I see... I'll take this as a warning that evaluating something based only on text without visual aids is hard for humans too.
I sort of want to get galaxy-brained here. You're a GM but you're playing pessimal moves; your opponent is a master but is playing rather oddly as well, choosing 1… e5 rather than 1… d5. This is so bizarre that I can't assume your opponent will take the mate in one!
Well yes, I would not give the GM 0% here. Because of this, or because the time control could be so short the master could be using premoves (actual or mental) and miss their chance. But 35% is still a little optimistic.
Edit: 1. ... e5 is not too bad though. Still -0.7 according to Stockfish, and if a stronger opponent takes you out of your preparation by playing suboptimal moves but you expect them to be prepared, might as well try to turn this strategy on them.
I looked up the Grob attack and came across this gem:
https://simplifychess.com/opening-encyclopedia/grob-romford-counter-gambit.html
> The Grob: Romford counter-gambit is extremely rare and occurs in less than 1 in 10,000 games. From the data, you can expect that White will only have a 0.0% chance of winning, while Black should have a very high 0.0% chance of winning.
An interesting game where noone can win. Seems like the only winning move is to not play.
Prediction markets thinking it's more likely a Democrat will win than Nate Silver's model does is really making my head spin. It's been the other way since 2016!
What I want to know is whether having a large number of ChatGPT predictions, updated daily, would improve or worsen Metaculus’s calibration. Especially if asked to adopt a bunch of different personas. On the one hand, maybe they’re not as smart as the superforecasters. On the other hand, maybe they’ll be somewhat uncorrelated with human opinion in a way that tends to cancel bias.
A correction on the UK betting market scandal: the initial wave of stories was focused on three close aides of the prime minister, who were found to have placed bets on the date of the election the day before it was announced. The conjecture is that they were told the date of the election at that point by the prime minister, but others argued that this news had leaked into the more general Westminster gossip mill. These aides are being investigated for the gambling commission for betting with insider knowledge. Members of the Metropolitan Police were also caught up in this scandal.
A second wave of stories caught up any political figure who has ever gambled on politics: these included MPs hedging against their own election losses. As far as I can see, the gambling commission has not acted against these MPs.
At the time, there was conjecture that the small sums involved were due to betting limits placed on new accounts by popular betting sites. There was suggestion that the accused may have bet the maximum available on several sites to maximise their winnings. I don’t know how plausible this is.
From my perspective, the law seems to be fairly reasonable here, with those suspected of using insider knowledge getting investigated, while those hedging are not. However, the conflation of the two by the press was frustrating and seemed partisan: the politicians accused of cheating were Conservative, so there was an attempt to find non-Conservative politicians who might also get in trouble. (This gripe aligns with my political bias and should be taken with a pinch of salt.)
I wonder if you could improve AIs by a having them bet on things with the stakes being a pool of compute time?
But they don't need stakes, right? They're not like animals, which will compete for food. We give them goals. If we say, make a bet, they will. If we say, research the other bettors to see whether they err systematically in some way, and use that info to improve the chance your bets will win, they will do that.
Thinking about it more, we do this already with "genetic algorithms". The prize there is that the winning models keep computing, and the non-winning ones don't.
Yeah, I think it's more like plants. We select for traits we want, and the chosen specimens go on to be the basis for the next generation. (Or maybe we just find one we like and clone it a bunch.) In the case of weeds, we do our best to detect and eradicate behaviors we don't like, but sometimes the result is that we create bad behavior that is robust against our methods of detection and eradication.
"They [Pollymarket] don’t have the strongest accuracy track record "
Is this based on anything?
>And the end result of all this work and all these millions of dollars is indistinguishable from a coin flip.
Yeah, that's unfortunately pretty unavoidable if you have an *efficient* two-party system; either side, if it's competent, can notice it's under 50%, identify the policy positions it needs to take to get to 50%, and make them. And if they get to 52% the other side can do the same thing; the equilibrium is at 50%, if each side is willing to do or say anything to win.
But keep in mind if either side *didn't* do all this work and spend all this money the other side would crush them. And, importantly, it's all *forcing* both sides to be efficient in order to stay in the race, which in theory means appealing to lots of voters. So all that work and money isn't doing *nothing*; we *could* just flip a coin instead, but then there'd be nothing forcing the candidates to do what people want.
>Will there be conclusive evidence of a Haitian immigrant in Ohio eating a dog, cat, or similar pet before 2025?
I feel like 'they're eating the dogs, they're eating the cats' implies *at absolute minimum* 2 immigrants who have eaten 2 dogs and 2 cats between them. To be fair.
Of course, in reality the implication in context was that this is an epidemic worth determining our immigration policy over; I wouldn't say the statement as made was 'true' unless it was something like 100,000 actual pets (rather than strays) minimum.
But for the bare minimum 'technically not a lie' type of lie, it would need to be 2x2x2.
There are only ~20,000 Haitians in Springfield, and the existing population is only ~60,000 people. Or ~140,000 in the greater metro area. So the median Haitian would have to eat 5 pets, and devour pretty much every pet in the region to hit 100,000.
Sorry, I meant nationwide.
Any Haitian in Ohio can bet big money on this, and then publicly eat a cat.
I've been repeatedly assured that prediction markets don't work like this and no one will ever do unexpected terrible things in the real world just to manipulate them, but I still don't understand why not.
What I recall from Scott’s FAQ is a discussion about how manipulating the market’s side is really hard. I don’t remember what he says about manipulating the real world. What comes to mind is a mixture of SEC-like rules (the specifics elude me, since I’ve heard very contradictory opinions on whether the SEC is a joke to people into trading or it very much isn’t) and harder-to-game (no individual action) resolution criteria.
After that, I checked and Scott seems to write in (4.1) of the FAQ (I’m very loosely summarizing what I’ve just skimmed through): in fact, you might want to let people do that in order to get more accurate markets (“create an unbiased social consensus”). Also the stock market has the same issue, and we can live with it.
I should re-read that FAQ again when I have the time.
> Also the stock market has the same issue, and we can live with it.
The stock market has a very elaborate set of laws and law enforcement designed to prevent manipulation and abuse. What do you think the SEC does all day? I mean ok, sure, they spend all day fining people for texting or whatever, but they *also* work tirelessly to fight market manipulation and fraud. And it still happens from time to time even in spite of all that!
They’re Scott’s words, not mine.
But I thought “live with that” meant “making appropriate efforts that we can afford to keep it to a manageable level” rather than “fight tooth and nail to prevent any instance of insider trading to ever happen”?
I don't think Scott (or the "insider trading is good, actually" people in general) really understand a) how bad insider trading actually is or b) how much work goes into preventing it in the real world.
At Manifest, many people openly manipulated Manifold markets related to Manifest (e.g. "will someone bring a ball pit to Manifest?", "will someone bring 100 cookies to Manifest", "will I have sex at Manifest", etc.)
I think that would be the sociological / reputational equivalent of burning their own house down for the insurance money while trapped inside.
"At least one proven case" is a useful Schelling point for whether it's worth investigating further, since that would clearly eliminate the "literally never happened" scenario, while the folks arguing "happened a hundred thousand times" can't reasonably say that a mere 0.001% success rate in gathering independently verifiable evidence is too much to ask.
>But also, my attempts to play around with the bot haven’t been encouraging:
I don't know if there's a term for this in the AI field, but these are the kind of 'gotcha' examples that don't really sway me as disproving the power of the technology. Like, yes there are some specific types of mistakes it makes and yes if you keep poking it it will make some obvious mistake eventually that you can highlight. But that feels largely orthogonal to whether it generally has the ability to do the thing it is claiming to do, modulo a couple mistakes it makes that you can work around or patch later.
The big example of this for me was artists who said 'Generative AI will never replace artists, look it drew six fingers in this image, haha it's so bad.' And it's like, yeah, it had trouble with fingers some percent of the time, so you just run it again when that happens and get what you wanted? And then 6 months later it figured out fingers and rarely ever makes that mistake anymore... so now is your argument defeated and it will replace all artists forever? Or did you have a more substantial argument than 'look at these weird fingers'?
(Which is not to say that I disagree with Scott's impression of this bot, I'm more interested in the discussing teh general phenomenon in criticism and seeing if people have thought the same or have different takes on it)
The question is always something along the lines of "Does this AI actually understand what it's doing?" When people answer in the negative, they are looking at evidence the AI is making a fundamental mistake. If Trump has a 55% chance to win because the AI thinks he's running against Biden, there's a more fundamental problem. It's not obvious at that point if the problem can be remedied. If it can make such a significant error, how do we trust that the rest of its reasoning is better? The same can be said when it gives different percents for the same question, or different percents for slightly different wording of the same question. If it were making predictions based on concrete evidence, it should not do that. Like the Prospera going to 1,000 and then 100,000 people getting the same percent. It's evidence that the AI isn't actually computing anything. It's not definitive, but if it turns out that the AI is not doing prediction but instead doing reasonable sounding projections regardless of the evidence, that would also match those results.
EDIT: The point being that when the AI messes up logic when the process is legible to us, we should doubt the process is working better when it's illegible. If all AI logic is legible to us, then it's not doing anything above human level, so it's not serving much purpose. Most people who use AI right now are sane-checking or fact-checking the results. This does not work at scale and is at best a timesaving device for people who are capable of doing the checking. End of Edit.
One of the causes of hallucinations is an AI trying to answer a question for which it doesn't have an answer. That seems both more likely when trying predict probabilities and harder to say it's definitely wrong when it's percents, unless there's a clear 100% or 0% in reality. If we asked it the percent chance of Oprah winning the election and it spat out 1%, we might think it wise at first glance. It's very low, signifying that it's quite unlikely, but it's not zero. If it said "darwin" or "Mr. Doolittle" had a 1% chance, we would think something's wrong. But if we don't know why it makes the prediction it does, then Oprah's 1% might be similarly wrong or just pointless hedging. If it's just hedging, then you can ask it about a couple hundred famous people, get 1% for each of them, and realize that it's entirely unhelpful to have a 390% chance of one of the people you selected being president. Funnier, but more helpful in determining if the AI is actually doing something useful, if you could get it to sum to 100% or more without including Trump and/or Harris in the percents.
< Or did you have a more substantial argument than 'look at these weird fingers'?
Yes, I do. Many AI images of people have characteristic flaws. For instance GPT4 leans very strongly into making people youthful, well-groomed and hot. I tried to get it to make an image of an average, out-of-shape middleaged man, and it simply could not do it. Its images were of hot, bodybuilder guys with gray hair and crow's feet. By the end I was literally saying things like "no muscle definition; thinning hair; flab" and it told me it could not do it. Or here's another example: I asked it for an image of a bunch of people sitting around a table with their feet up on the table. It gave me an image of a bunch of corporate types in suits around a table. One had his feet resting on it. But it missed the point, which I thought obvious, that people with their feet on a table would be in a relaxed, informal setting. Dall-e 2, by the way, got that part right. It showed groups in jeans, shorts, etc. sitting around a picnic table or in other settings where feet on table would not be odd. It had a harder time than Dall-e 3 (which GPT uses) getting feet on table, and had various hilarious misfires, but it got the setting and feel right. GPT misses the point of waves, too. It doesn't understand their structure, so shows waves far out to sea breaking (I'm pretty sure that breaking happens when the wave encounters a coral reef or shallower land). Sprays a bunch of fake-looking foam or mist around, as though to cover the defect in the main subject.
To be pedantic, I believe sufficient wind can also cause breaking, but at that point the wind should be creating a lot of other effects, too.
Yes you're right, wind can too. But there are a bunch of other things about the structure of waves that AI doesn't get. I have a whole collection of shitty AI-generated waves. Some are so bad they're funny.
I've been playing with the ChatGPT models for around the last year, and I ask them straightforward questions (almost always STEMM fields - I'm not trying to probe Woke indoctrination during RLHF), and I still see plenty of errors (say on the order of 50%). See, e.g. a question about whether the Sun loses more mass as energy radiated away or the solar wind:
https://chatgpt.com/share/66e7818b-aa80-8006-8f22-7bbff2b12711
It gave an incorrect initial answer, then I forced it to reconsider a units conversion, and that got it to the right answer. This is the new ChatGPT o1, and it did considerably better than ChatGPT 4o, which had to be lead by the nose over a considerably longer distance (but it looks like I can no longer access the URL for that session)
>Why did this go down so much in April 2024?
A person debunking a lot of the superaging research won an Ignoble Prize this month. I wonder if their research got wide notice in some communities and became visible to the people researching this question around April?
My impression is that all that news broke a little too early ... Nov 2023 through Feb 2024 roughly (with no effect on this market). I think the 'this market was put on the front page' explanation is more likely.
>Will there be a third assassination attempt on Trump before Election Day?
I feel like if we count what just happened as an attempt (Secret Service foiled it before anyone took a shot at the president) then this number is low, but maybe (probably) I'm misinformed? I sort of assumed that there's a general background of crazy people making plans against presidential candidates all the time, and most of them get foiled early and we never hear about them.
Eg, I would have guessed 2-5 foiled attempts that we never heard about because the SS stopped them quietly for each candidate, but maybe I'm totally wrong and there's fewer violently insane people than I thought.
I'm trying to think of the last time someone even attempted to shoot a president before Trump. I only remember a guy shooting at the White House across the lawn back in the Obama years.
There are a ton of these I never heard about. A guy in North Dakota hijacked a forklift, and planned to kill Trump... by flipping his limo over? And in 2013, there was a guy who worked for GE that was arrested with plans to build a "radiation gun" out of industrial x-ray equipment to kill Obama. Someone threw a hand grenade at Bush when he was in Georgia (not the American one) but it didn't detonate.
I feel like 'our failures are public but our successes are classified' was a big thing in this arena, but maybe that's literally just something I heard on The West Wing once and is not actually true?
Yeah it's interesting to consider how ambiguous the definition of 'attempt' can be, if you're sufficiently lenient or imaginative...
I always find the low number of assassinations and attempt surprising. I'm writing from the USA, with around 25,000 homicides each year. It surprises me that at least 1% of these killers aren't interested in politics, which would give almost an assassination or attempt each day.
I'm guessing most of those involve family, romantic partners, or drug abuse (with "drunk driving" being a type of drug abuse, of course), and escalation in the heat of the moment, rather than premeditated attacks on people they'd never previously met face to face. https://www.smbc-comics.com/comic/the-chosen-one How many people has Trump *personally,* say... cuckolded, or tried to sell substandard meth to, or taunted while they were in the same room, drunk, and armed? Surely not millions! Maybe not even hundreds.
>I'm guessing most of those involve family, romantic partners, or drug abuse (with "drunk driving" being a type of drug abuse, of course), and escalation in the heat of the moment
Sure! I would also expect most of them to be "heat of the moment" killings as well. Many Thanks! But that could account for well over 90% of homicides, and still leave 1% as premeditated assassinations. People can get quite heated over politics too.
We might be heading back to the "good old days" of anarchists tossing bombs at politicians...
Many Thanks! Ouch! For a more recent comparison: There are parts of the 1960s that I want back, but it is the Moon landings part, not the assassination part.
The president's not the only potential target, though. When someone hits the critical intersection of "angry enough to murder a government official" and "competent enough to succeed," that'll probably be because they've got an otherwise unsolvable grievance with someone specific.
For example, the guy who shot James A. Garfield felt he'd been personally slighted by not having his efforts rewarded with an ambassadorship.
Lee Harvey Oswald may have actually been aiming at governor John Connally, who had some role in Oswald's dishonorable discharge, and only hit Kennedy by mistake. Not directly relevant, but there's an amusing parallel that Connally, like Trump, survived by virtue of having turned his head at just the right moment.
Marvin Heemeyer (who didn't actually kill anybody besides himself, but seems like a noteworthy high water mark on the "competent solo premeditation" side of things) went after folks he had a long, contentious history with.
However many would-be political assassins there are in an average year in the US, there's no particular reason to think all of them would be aiming at the top. I'd expect state and local officials - particularly public-facing "bearers of bad news" like process servers, judges, or bureaucrats positioned deny essential services for opaque reasons - to attract at least as much hate-plus-follow-through per capita as executive positions.
With one president, fifty state governors, and I don't even know how many mayors, or de facto "company town" owners, or self-appointed HOA busybodies, whatever fraction of the potential assassins specifically prefer an exec as their target gets reduced by a few more orders of magnitude before you've got the ones who even *want* to kill the president. Then that tiny pool takes a look at the security, and some of the most lucid, competent ones think to themselves something like "maybe there's another, easier way to impress Jodie Foster."
>The president's not the only potential target, though.
Agreed, Many Thanks!
>However many would-be political assassins there are in an average year in the US, there's no particular reason to think all of them would be aiming at the top. I'd expect state and local officials - particularly public-facing "bearers of bad news" like process servers, judges, or bureaucrats positioned deny essential services for opaque reasons - to attract at least as much hate-plus-follow-through per capita as executive positions.
Also agreed. But I think we would hear about a large fraction of attempts on the lives of most public officials (ok, maybe not if if gets down to the level of municipal dog catcher...).
Hmm... https://www.vox.com/world-politics/360639/trump-shot-thomas-matthew-crooks-assassination-attempt says
>Threatened acts of violence have increased even faster. In the United States, the Capitol Police reported 9,625 threats against members of Congress in 2021, compared to just 3,939 in 2017.
Maybe more goes unreported in the media and news aggregators than I thought... I certainly wasn't seeing 30 stories per day about threats against congresscritters in 2021... ( Of course, I am _assuming_ that "threats" means "threats of physical violence", not e.g. "threats of lawsuits" or "threats of funding campaign of opposing candidate"... )
"Anonymous jerk sent a strongly-worded letter w/ hyperbolic, unsubstantiated threats" seems even less newsworthy, in and of itself, than "dog bites man."
With biden wearing a maga hat after being forced out, and yet another assassination attempt; if trump is assassinated, will biden endorse vance for the lolz.
While I think awful things about biden, he did follow trumps push to end the forever war in the middle east evidently while the system wanted it to be a distaster, so he has a weak but extant moral code and spite. If assassinations are on the table, and they did litterally everything to pick someone else to win; well maybe he causes a little chaos.
1) 85% of Metaculus' forecasts in the next twelve months will be confirmed within twelve months after that, for all forecasts whose outcomes will become clear within twelve months.
2) The 2024 election will be significantly delayed by lawsuits contending the outcome
3) There will be street violence between protestors before the end of the year (hoping I'm wrong on this one)
Re 2, how is now different from 2020, when there was no shortage of attempts to contend the outcome?
One difference is that Georgia Republicans have passed a law requiring all ballots to be hand-counted, which effectively guarantees that Georgia will take a long time to call.
Apropos of nothing, I just finished reading "Clear and Present Danger" by Tom Clancy (published 1989). There is a passage where he seems to anticipate prediction markets.
"When would they wake up and realize that predicting the future was no easier for intelligence analysts than for a good sportswriter to determine who’d be playing in the Series? Even after the All-Star break, the American League East had three teams within a few percentage points of the lead. That was a question for bookmakers. It was a pity, Ryan grunted to himself, that Vegas didn’t set up a betting line on the Soviet Politburo membership, or glasnost, or how the “nationalities question” was going to turn out. It would have given him some guidance."
Robin Hanson is a closet Tom Clancy fan, you heard it here first!
I think I've said this before, but whenever I see this post title, I immediately think of Dr. Mantis Toboggan. "You got the HIV!"
There's so much potential for non-public information in election betting, and nd I'm not sure I'm upset about all of it. A candidate wants to a little cash cushion in case they lose and will be out of a job? At least they're probably making the prediction more accurate. And while it would be a bad thing, I'm kind of excited to read about the first electoral Black Sox scandal when it happens.
If asked whether incumbent governors who are not up for re-election this year or next (identifying the names) will be governors of their respective states in 2025, FiveThirtyNine assumes that the governors are up for reelection in 2024 and then goes through reelection analyses and then arrives at wildly low probabilities. Based on the sources, the model "should know" that the governors won their respective four-year terms two years ago. Are those trick questions? I would argue "no." First, the questions don't present false assumptions -- asking if the governors will "win re-election" in 2024, for example, or proposing non-existent opponents. (A diligent and intelligent human shouldn't even fall for those "tricks," as they should investigate and understand the election cycles and the basics of who the opponents are - after all, how can you evaluate the outcome of an election without knowing those basic facts?) Second, non-AI forecasting and betting sites pose similar questions. It seems like this tool takes several steps back from the current AI models and really can't be trusted notwithstanding the sourcing and chain-of-reasoning features.
I asked the likelihood that Ronald Reagan would be elected president again, and it was pretty good about it. It said 0% chance, mostly for two reasons: 1) he was already elected president twice, so isn't eligible, and 2) he's dead.
>I actually appreciate this a lot, because most of the debate around Catgate has focused on how there’s “no evidence” it’s happening, but “no evidence” is cheap and I prefer an outright forecast.
JD Vance didn't bother to confirm the rumor before he shared it, so why do you demand that the people refuting him put in more effort than he did? "No evidence because nobody was looking for evidence" is a fair criticism when someone is trying to shed light on something that scientists should investigate, but when you apply it to a politician (one who literally said that he's willing to make up stories if it helps his cause) then you're basically giving them a free license to Gish Gallop, since it takes no effort to come up with bullshit but it takes effort to refute it more substantially than "no evidence."
(By the way, I heard a rumor that Scott Alexander eats cats! You are obligated to take me seriously until you personally go to Scott's house and look for dead animals. Maybe you should set up a prediction market to see if it's true!)
Also, in this case, there is a place journalists can check for evidence pretty easily - ask the Springfield police if they received any reports of pets being eaten. They have not, AFAIK. So we can conclude that, if someone did in fact witness a pet being eaten, they apparently didn't think it was worth calling the cops over.
One thing that irritates me about the edible pets story is that, _even if it were true_, it would be one of the least important worries about a large number of (in other cases) illegal immigrants (I understand that the Haitians are here legally).
a) I've read reports of on the order of 24,000 PRC men of military age crossing the border and not being significantly vetted. Presumably they are here with the CCP's concurrence, given the controls the CCP has over the PRC's population. If the PRC were an ally of the USA, this would be no big deal - but that is _not_ the current situation.
b) Just generally, having on the order of 2 million people immigrate illegally without the checks we use on _legal_ immigrates to e.g. at least filter out gang members is alarming
c) Just generally, in order to be a nation with borders at all, we need to control crossings, and e.g. have a national consensus on e.g. what population we are shooting for in 2100. ( My personal suggestion would be to try to have roughly the population we have now - which suggests something like 1 million a year net immigrants. We've gotten bad at building infrastructure or sufficient housing. )
I feel like the resolution criteria on some of the cat eating markets is a little too vague. What does "widely trusted evidence" mean? Christopher Rufo already posted a video of a catlike object on an Ohio barbeque, but it doesn't seem to have changed anyone's mind on the question.
Criminal conviction or some other official finding of fact would probably suffice.
Setting aside general tech issues of fake video, something vaguely cat-like seen on a barbecue doesn't prove much at all. Legitimate butcher shops have to sell rabbit meat with the ears still on, because otherwise, with just muscle and bone, it's too difficult to tell them apart from cats.
Mathematician here: for the Math Olympiad prediction, I'd be surprised too. I've never heard of any AI's making any progress on unsolved conjectures, and while Math Olympiad problems are a lot more penetrable than something like the twin prime conjecture, they're still insanely difficult (and officially unsolved, though obviously the problem designers had to independently prove them).
That said, sometimes on Stack Exchange and other places, people will ask some unconventional math questions, so maybe some of the results have already been talked about in the corpus that the AI draws from.
i was not surprised by this and, while I can't prove it, I've been saying for years it was just a matter of someone getting around to doing the engineering work.
As Hilbert pointed out, once you pick a formal system to work within, mathematics just becomes a kind of parlor game. So, given a statement of the problem in Lean (theorem-proving system), DeepMind repurposed AlphaZero, their board-game-playing tree search system, to play the game of mathematics and search for formally-checkable proofs.
You still have to go from the natural language problem to the Lean theorem statement. this is a little challenging. but 1) mathematicians have spent the last 5 or so years building mathlib, a software library of well-designed definitions of common mathematical objects, so there's usually no deep creativity required 2) given that, you just have exactly the kind of natural language -> code task that LLMs are already quite good at.
At 73% for IMO Gold by 2025 I don't know whether I'm a buyer, most of the uncertainty would just seem to depend on project management decisions at large organizations I don't know much about.
A more interesting question is if/when a few "mainstream" mathematicians working on "mainstream" results, not related to foundations of mathematics, logic, or theorem proving itself, will have made use of AI-generated proofs of this kind.
Great response! I knew about Lean/mathlib, but the fact that I hadn't heard of any famous unsolved problems being approached this way made me suspect we have a long way to go. Somewhat tangentially, but "mathlib" makes me think of "shitlib", both of which describe me pretty accurately.
I'm not a mathematician but I am aware that Peter Scholze, with a team of people, formalized some famous recent result of his in Lean recently.
no way can they do that from scratch, but if this type of system can handle some or all of the steps in that proof, it would be cool to see. hope DeepMind tries it!
Right, that's just proof verification, not proof generation. Not that I'm ruling out the possibility of the latter, but the former was already done (on the 4-color theorem) in 1976. It's not *exactly* the same since the technology then wasn't validating the proof logically, just checking thousands of different cases for compliance, but in both proofs the real work was done by a human. Anyway, if AI's start to write proofs from scratch, I might have an existential crisis.
I do think there are probably quite a few open conjectures that are "basically just hard Math Olympiad problems" somehow.
E.g. the boolean sensitivity conjecture
https://www.quantamagazine.org/mathematician-solves-computer-science-conjecture-in-two-pages-20190725/
or the Gaussian correlation inequality
https://www.quantamagazine.org/statistician-proves-gaussian-correlation-inequality-20170328/
which were decently-famous open problems for decades, yet in the end turned out to have relatively simple proofs which just required some cleverness, rather than Herculean mathematical efforts.
There are surely more such problems, and it seems like if an AI system is going to solve a problem people care about, it will start with something like that.
Shouldn’t you start your existential crisis when you’re convinced that AIs will be able to write proofs from scratch in the near future rather than when you’re confronted to the event?
I think that (somewhat unfortunately, since I am also a mathematician) there’s already plenty of evidence available (most recently the DeepMind “perfect score without combinatorics” IMO problem solver).
Sure, an IMO problem is much more limited in breadth than a research problem. But is the difference so important as to make research problems intractable by similar techniques? I wouldn’t bet on it. That argument has already suffered many defeats.
Also, a couple of years back, the people involved touted the use of an AI to make 4x4 matrix product more efficient (shaving off one or two multiplications of entries), which by recursion makes matrix multiplication more efficient. Not really sexy, but it was an improvement – albeit one that didn’t shed much light on the situation (since it’s conjectured there’s an algorithm in O(n^{2+epsilon}) for any epsilon).
An interesting skeptical counterpoint from a mathematician (with unimpeachable credentials) is Silicon Reckoner. Among other thoughtful points:
1) formal proofs aren’t the alpha and omega of understanding, which is the mathematician’s actual aim,
2) a crucial part of our notion of understanding is about introducing the right concepts that shed light on a situation. Lean-based systems seem very poorly suited for this task.
Part of the problem is defining what the threshold for "AI" is. Because lots of problems have been proved by computer search (most famously the Four Color theorem, but more recently all of Marijn Heule's stuff).
I’m more interested in “automated theorem proving” than “AI” per se.
I would rule out cases where you just get a computer to enumerate some disgustingly large set to show that all counterexamples fail. So that rules out the classic 4 color result, and things like it.
I don’t know exactly how to formalize that intuition: maybe what we want are “short” proofs in the complexity theory sense.
But beyond that — if completely old school tree search worked, with nothing that looks like AI, I’d count it as a success. I don’t think that will happen though, because learned heuristics to guide tree search are really useful.
The SAT solver results I was alluding to made heavy use of heuristics in order to make the tree search more efficient (that's basically what a SAT solver is). Truly exhaustive search is impossible in all but the most trivial cases.
Interesting. These results all seem closer to "mainstream math" than "stuff that only computer scientists care about".
They still have this flavor of a giant computer search though. There is a lot of cleverness in the reductions to SAT and in some cases that seems to really blow up the problem.
This seems like a different thing than just writing a theorem statement in Lean, which is relatively human-comprehensible, and being handed a proof which you can then just hand to "lean --check", and immediately start building further results on if you want.
" I actually appreciate this a lot, because most of the debate around Catgate has focused on how there’s “no evidence” it’s happening, but “no evidence” is cheap and I prefer an outright forecast."
On the contrary, "no evidence" is exactly the right frame. In fact, "no evidence" is really somewhat too charitable of a frame.
A person campaigning for the most powerful elected office in the world made a claim, completely unprompted, that sounded like something lifted straight from a tabloid headline. That's a reasonable thing for a person in that position to do If-and-Only-If THEY HAVE ACTUALLY SEEN good evidence that it is true. When pressed on the claim, Trump attributed it to "the people on television." It may not be reasonable to expect a candidate to carry around specific details identifying that source in their head during the debate, but it's a pretty bare-minimum standard to expect them to provide it after the fact.
The only evidence that ought to be convincing here is Trump providing his source. Expecting a major-party candidate for president of the united states to display a modicum of epistemic hygiene when speaking before the entire nation is not something that SHOULD be a big ask. I don't think any of his skeptics will be the slightest bit convinced if thousands of motivated searchers working for multiple weeks eventually manage to turn up evidence of someone, somewhere chowing down on roast cat. Why should they be? It would be striking and sensationalist, but even if it turned out to be a Chinese cardiologist burglarizing peoples' homes by the dozens to steal cats for the stew pot, it doesn't actually address the important issue at all.
I maintain that Trump's utterances are (in almost all cases) not the sort that have truth values attached to them. It's exactly like a headline such as "Five reasons why tomatoes are the worst vegetable ever". You wouldn't expect that article to even make a reasonable case for the proposition in the headline, right?
It's bad on purpose...
> Still, it might be fun to keep going until you find an old post where the prediction has already “resolved”, and see what happens.
I think the "election called for Biden on election night" one in your screenshot can safely be considered "resolved".
> Trouble in England
Trouble in the UK. This is about the UK parliament (there is no specifically English parliament), and at least one of the MPs quoted in the article is from Scotland.
Errata note
> The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023
To clarify, I used GPT-4o-latest (not sure what ChatGPT base means, it's definitely a fine tuned model), and there's several different versions of 4o available at different times - presumably the same base model with different fine tuning. Long Phan (an author) claims on twitter that they used an earlier version of 4o and checked for data contamination, though doesn't specify which version - I'm not fully convinced but this claim seems plausible enough that the version they used didn't have contamination. https://x.com/justinphan3110/status/1835563533989494943
This all hinges on what exactly OpenAI did to train all the variants of 4o. My personal guess is that OpenAI in fact trained the original 4o base model only on data up to Oct 2023 (bizarre thing to lie about), but that future data leaked in during fine-tuning because it was not being careful enough, and that this means that different checkpoints may have different amounts of data leakage. But really who knows
> The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023.
Is this not just evidence that ChatGPT is really good at making predictions? 😜
So, any news on Prospera?
I ask because I saw this video on Neom. The presenter is a little more sceptical about it than I am, but we're both pretty sceptical. There are several examples in the video of other attempts to found a new city (the Turkish one is pretty special) and how they failed. I think he hits the nail on the head that the problem turns out to be economic: there just aren't the jobs available to support the new city, which then doesn't grow to be the projected wonder it was sold as, and then investors get cold feet.
I think he over-sells the objection to travel within Neom; yes, humans don't like going long distances but we do okay with climbing, pretty much? But his description of Neom and moving around in it reminded me of Chongqing, another new-constructed city which has baffling infrastructure. Even the natives find it tough going at times:
https://www.tumblr.com/christadeguchi/735336251281047552?source=share
What floor are we on and where is the street?
https://www.tumblr.com/fuckyeahchinesefashion/760517112628461568/chongqing-residents-how-long-it-takes-to-walk?source=share
Chongqing residents: how long it takes to walk from the doorstep of your condo to the entrance and exit of your xiaoqu/neighborhood and get yourself stand on the street
Yes, Chongqing is a success, but Chongqing at least is walkable, while Neom seems not to be. I don't think it'll ever get off the ground (as it were) and examples like these incline me to the opinion that Prospera will turn out to be the same sort of white elephant/boondoggle. You need lots of ordinary people to live and work there, and selling it as an economic opportunity doesn't work without the support of, well, ordinary life. Unless it's like Washington or Canberra, where the industry is the government and its apparatus, which is very much the opposite of the selling-point for the likes of Prospera. Even Strasbourg existed first as a city in its own right before becoming EU government headquarters.
If you live in Chongqing, you better like climbing and long walks 😁
https://www.tumblr.com/transparentgentlemenmarker/743300569993789440/chongqing-en-chine-est-nomm%C3%A9e-la-ville-montage-les?source=share
https://www.tumblr.com/onlytiktoks/754094829838974976?source=share
I just hate the term "No evidence" since almost 100% of the time it is used, there is in fact *some* evidence. In this case that Trump said it is itself evidence! As are dumb tweets. Doesn't mean it is good evidence, but it IS evidence.
Almost always what it actually means is "I am making some arbitrary cutoff regarding what level of quality *evidence* needs to be to be considered"...and then totally coincidentally that cutoff is at exactly the level where there are no evidence.
Not to mention, you know, Hempel's Paradox.
There's "no evidence" in the same sense that there's no evidence that Trump is secretly a Chinese robot in disguise. You do have to draw the line somewhere or you'll never be able to say anything at all. And the cat eating claim is pretty firmly into Andrew Wakefield territory by now.
Meh its not "no evidence". Say something else.
What is your preferred terminology then?
Do you also get so mad when people say that there's "no chance" that the sun will fail to come up?
At some point, you have to round off or it is impossible to say anything at all.
I think no evidence means something very specific and gets used enough in contexts where there is actually A TON of evidence, it just isn't where the preponderance of the evidence lies that its misuse gives people very bad ideas about epistemology.
Something where there are 500 pieces of evidence for (by some arbitrary cutoff and 10,000 pieces against, doesn't have "no evidence for it".
It is just sloppy language. Like saying it is fine to say something is true as long as it makes you feel good. That isn't what "true" means!
I think you are overly caught up in the particulars of this example, I have hated this particular use of language long before Trump was even President the first time.
"That doesn't appear to be true".
"Reporting points to this being false."
"We couldn't find much evidence for this occurring outside internet memes?"
etc.
etc.
It's true that the phrase "no evidence" is overused. Scott has written about this here in the past as well.
However *this wasn't one of those situations*. In this case, there really was no factual basis for Trump's claims at all. It's about as well supported as "Trump is a pedophile". In some pedantic sense, there *is* evidence that Trump is a pedophile because I just said it and you argued up thread that people saying things counts as "evidence". But it's not evidence in the ordinary sense of the word.
This isn't even an isolated incident either. Trump *regularly* makes claims that even *Trump's own campaign* admits they have no evidence whatsoever for. Trump doesn't appear to believe that words are supposed to have any relation to reality at all - you just say whatever you feel like and that's your personal truth. It's very Colbertesque.
Look I am no defender of Trump's utterances. But something 30% of the population beliefs or whatever that is making the rounds on social media doesn't have "no evidence for it". The MSM is frequently wrong about such things, and that many people believe them is absolutely evidence even if they are mistaken.
Hey,
Just noting that we posted a reply to common criticisms here, although it hasn't been seen by many people yet: https://x.com/justinphan3110/status/1834719817536073992
For example,
> The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023.
ChatGPT isn't the same model under the hood as "gpt-4o" on the API (which currently points to "gpt-4o-2024-05-13"). We use this API model in our evaluations and verified that it does not have knowledge from before the cutoff. We also took a number of other measures (described in the above post) to ensure that there wasn't contamination.
> When presented with a different set of questions that were all after November 2023, FiveThirtyNine substantially underperformed the Metaculus average.
This isn't true. As we describe in our reply, these questions were mostly short-fuse questions from Polymarket. In our blog post, we specifically stated that the system performs less well on short-fuse questions (https://www.safe.ai/blog/forecasting), which Halawi might have overlooked. Across the 35 Metaculus questions in this independent evaluation, our system is within error of crowd accuracy, **supporting our original results**. We also ran o1-preview on these questions, finding that it increases accuracy even further.
Halawi also shows some screenshots of knowledge leaking from ChatGPT from past its cutoff date, but again this is not the model that we used. When using the default "gpt-4o" model on the API, the same questions show no contamination.
> The FutureSearch team wrote a LessWrong post generalizing these kinds of observations, Contra Papers Claiming Superhuman AI Forecasting.
The section of this post discussing our work is mostly citing the points discussed above, so I don't think there is much new here. It does dispute what is required to claim superhuman forecasting performance. This is a definitional argument. In the blog post, we are clear about what we mean by superhuman performance (matching crowd accuracy), which I think is defensible and is at least quite meaningful. Our performance on Brier score is also quite good and significantly improves after post-hoc calibration. Note that neural network classifiers are often poorly calibrated, and post-hoc calibration is commonly done in ML research to convert accurate but uncalibrated models into models that perform better on proper scoring rules.