137 Comments

Seeing that number on Trump/Colorado made me check up whether Trump v. Anderson had been decided or not. It hasn't.

Expand full comment

>fair warning: if you already hate any of rationalists, San Francisco, tech, prediction markets, polyamory, betting, love, reality shows, or Aella, this will definitely make you hate them more.

Glaring exclusion of "musicals" here

Expand full comment

The Manifold show reminds me of the “industrial”, short for “industrial musical”: “a musical performed internally for the employees or shareholders of a business to create a feeling of being part of a team, to entertain, and/or to educate and motivate the management and salespeople to improve sales and profits” https://en.wikipedia.org/wiki/Industrial_musical

I have an older friend who did these decades ago for companies including Johnson & Johnson. Like, he wrote parodies of Broadway musicals, but they were about surgical dressings and things like that.

Expand full comment
Feb 20·edited Feb 20

> Generating the full movie

> The dumbest possible way to do this is to ask GPT-4 to write a summary (“write the summary of a plot for a detective mystery story”), then ask it to convert the summary into a 100-point outline, then convert that into 100 minutes of a 100-minute movie, then ask Sora to generate each one-minute block

I've spent several weeks trying to make GPT4 generate at least medium quality detective plot - and was forced to admit complete defeat. Not only I was unable to make it generate a consistent plot with clues and foreshadowing, even asking it to retell well known plot would show big holes in understanding. So either I've missed something big, or you are dramatically underestimating the difficulty.

If anybody is working on something similar and has at least mild success - I'd be very interested in talking, so please reach out by mail rlkarpichev at outlook

Abbreviated list of stuff I've tried, and how it failed:

Here and later plot/plot descriptions/beatsheet/outline/... (and a lot of different tried formats) = descriptions

1. Straight description generation (i.e. generate plot/beatsheet/outline/...) - would successfully generate some list of linked events, but each event is described with too few details, which disallow most causal links (clues/redherring/Chekhov guns/etc)

2. A lot of attempts to increase level of details (json/yaml formating with separate fields for different details, dozens of prompts, chain-of-thoughs, examples in prompt, ...). Two failure modes - either abandon format halfway (so still not enough details) or formally adheres to format but in practice made a mockery of it (i.e. {"event": "Detective X finally arrested Y", "why it happened": "X caught Y unaware, so Y couldn't resist"} - ignoring possible clues from context)

3. One thing that *almost* worked is reflection/self improvement - generate description, ask GPT for problems, then ask to fix them. It successfully identified the problems and generally improved description in first 2-3 iterations but then run face-first into lack of necessary details or problems with modifying the description (see the next point)

4. Fixing bad plots descriptions after the fact by handcrafted prompts (i.e. ask to generate, then look at it and say "Ok, GPT4 please add clue to Act 4"). Results: Mild success hiding major failure. While GPT easily modifies descriptions it either tries to modify whole description to accomodate (thus creating other problems), or often miss obvious reasons/consequences which also should be modified to have a coherent plot.

5. As a hail Mary tried to use agents - got only two different ones working, and both would fall into introspection or fixate on minor details. Haven't spent enough time on it, so not really sure if all agent like that or what

6. At this point I was ready to give up, so I decided to do something really easy - generate description of first Harry Potter book/film. Considering the number of exact texts, fanfiction and analysis GPT have seen during the training, it should have been a piece of cake. It wasn't. After spending couple of days trying to make GPT spit out description which would have at least half causal links between events and 90% notable events, I was sure that either GPT4 can't do it or I'm too stupid to make it work

Conclusion:

I walked away convinced that GPT4 has very weak grasp on causality, and with much better intuition about limitations of LLMs .

Expand full comment

Your list of seven items ends on a 5.

Expand full comment

It seems that your Manifold market is about an AI creating a movie from a single prompt, while the Metaculus question just asks that all the audiovisual material are created from an AI. Meaning that it could take as input the complete script and more, and also that it could be edited by humans.

Expand full comment

I didn't think the manifold show was well done at all. Too long, far too many puns (stop writing for other writers), an ugly set.

The only good part was the dance and sheet.

Expand full comment

So text to video is expected to cost over 50 cents per minute at the end of 2024, I think I can expect that the capability to make a full movie for me personally that I would pay for is on the low side

However the ability to buy shares in a certain prompt by 2030 is a really pleasant prospect. And if 50 cents is in the ballpark then it won't take very many shareholders/funders at all to fund an entire movie at lower than current prices

So in 2015 we were asking where our hoverboards were. In 2024 that would be nice, but no longer impressive. If AI unlocks forecasting that may be a meta technology with beautiful far reaching consequences

Expand full comment
Feb 20·edited Feb 20

The AI movie market tanked at the start of the year because someone sold all their shares (don't remember who, IIRC they retired from manifold ETA: it was @Mira).

It's hard to read the logs as selling of shares doesn't seem to be registered as a transaction the same way that buying is.

Expand full comment

All the time I see people upset about how porn takes advantage of actresses so it's really depressing they don't seem to form a constituency demanding that these AI picture/video engines be freed up to make porn.

Expand full comment

EDIT: Scott pretty effectively demolishes this objection below.

Regarding accuracy vs. humans, I don't think this is very interesting since they can look at the best human forecaster's trades and simply try to improve on them. In other words, even in the contest on preselected outcomes for accuracy the bots only need to do something a bit more complicated than look for an arbitrage opportunity. For instance, if they get to time the purchase just act quicker than the human superforcaster to new news. Even if not as long as these markets are largely bragging rights and aren't liquid and efficient there is every reason to believe even the best players won't put in the effort to give the absolute ideal valuation and all the AI now needs to do is identify the best forecasters and some systemic error they make.

Expand full comment
Feb 20·edited Feb 20

>AIs are willing to work for less than $1 per hour, and have the knowledge of an encyclopedia - and if that's not enough, they can even be integrated with real-time web search capability.

Let us see if the accuracy can be improved first. I'm still seeing very wrong answers to the simple "What inorganic compounds are gases at STP?". No reference classes, no probability adjustments, no theory of mind needed, just a summary of uncontroversial facts. The latest answer (from Gemini) included water in the "gases". :-(

Expand full comment

Re the criterion of embarrassment, it's not that the Gospel writers "must have" been telling the truth, but that it's likelier than not that they are, or at least that it raises the likelihood that they're telling the truth. (That it's talking about what's "likely", not what "must have" happened, is even the way it's framed in the Wikipedia article you linked to.)

Expand full comment

>“postrationalists” who were vaguely inspired by rationalist writings but also think the emphasis on facts is boring and autistic and we need to focus more on creativity/friendship/woo/intuition/vibes.

Right, because nothing is as boringly uncontroversial and fact-based as being extremely sure of an imminent robot apocalypse. A more fair description of the main disagreement would be that they prefer somewhat different woo/intuition/vibes.

Expand full comment

> An AI that can generate probabilistic forecasts for any question seems like in some way a culmination of the rationalist project.

Yeah, in a "we've come full circle" kind of way. Rationalism was supposed to be all about justifying your estimates based on evidence, as opposed to blindly trusting some authority or indeed your own gut. And now, it "culminates" in a blindly trusting a black box to produce estimates for you. Progress !

Expand full comment

> FutureSearch’s AI tries to do something similar. It prompts itself with questions

Finally someone is contributing actual effort to produce reasoning ability from LLMs. We should pour much much much more effort in scaffolding and prompting of what we already have instead of training bigger and bigger models.

> Vitalik On AI Prediction Markets

An interesting idea, but as with every other idea for crypto application, it would work even better without crypto.

Also, yes, I think it's quite likely that regular market mechanisms will just eventually produce a leader from all these oracles-bots, and then it would be little need to use the prediction market type of system aggregating the predictions of inferior models.

Expand full comment

What question(s) caused Gemini and Claude bots to sharply go into negative profit around January 08?

Expand full comment

I've posted about this on the Open Thread, but it seems worth repeating here.

As seen above, it is currently possible to buy no shares on Trump winning the presidential election at 46c on Polymarket (representing a 54% chance that he wins). I note that one market participant has bet $907k that Trump will win and now holds 27% of the shares. Based on their activity, it seems plausible this person has appetite to bet more.

Meanwhile on Betfair Exchange, odds of 5/4 are offered on Trump winning (repesenting a 44% chance that he wins). It would be possible to bet $3,750 at those odds right now, but I would be confident that a modestly larger bet would be matched fairly quickly.

The resolution criteria are slightly different. Polymarket has: "The resolution source for this market is the Associated Press, Fox News, and NBC. This market will resolve once all three sources call the race for the same candidate. If all three sources haven’t called the race for the same candidate by the inauguration date (January 20, 2025) this market will resolve based on who is inaugurated."

Betfair has: "This market will be settled according to the candidate that has the most projected Electoral College votes won at the 2024 presidential election. In the event that no Presidential candidate receives a majority of the projected Electoral College votes, this market will be settled on the person chosen as President in accordance with the procedures set out by the Twelfth Amendment to the United States Constitution. This market will be settled once both the projected winner is announced by the Associated Press and the losing candidate concedes. If the losing candidate does not concede, or if there is any uncertainty around the result (for instance, caused by recounts and/or potential legal challenges), then the market will be settled on the winner decided by Congress, on the date on which the Electoral College votes are counted in a joint session of Congress. This market will be void if an election does not take place in 2024. If more than one election takes place in 2024, then this market will apply to the first election that is held. Once voting (whether postal, electronic or at the ballot box) begins in the year 2024 for the US Presidential Election 2024, the election will be deemed to have taken place for the purposes of this market. We will then settle the market as per our rules regardless of whether the election process is fully completed in 2024 or beyond. If there is any material change to the established role or any ambiguity as to who occupies the position, then Betfair may determine, using its reasonable discretion, how to settle the market based on all the information available to it at the relevant time."

In most cases I imagine those criteria will produce the same outcome, but if anything I would expect Trump to be slightly more likely to win by the Betfair criteria.

So suppose somebody bet $5k at 5/4 that Trump would win. At the same time they could spend $5,125 buying no shares at 46c, which gets you 11,141.3 shares.

If Trump wins, the yes bet pays $6,250, less 2% commission, is $6,125 (NB bets are actually in GBP, I convert to USD for simplicity). The player loses their $5,125 no bet, so they are $1,000 up overall.

If Trump loses, the no bet pays $6,016.30 (actually in USDC). The player loses their $5k yes bet, so they are $1,016.30 up overall.

Ignoring transaction costs, this represents a return of 9.9% on $10,125 over a period of less than a year.

Of course transaction costs might be significant, and the player is also exposed to some risks, including at least: (1) USD:GBP currency fluctuations, (2) the stability of USDC, (3) the reliability of both exchanges and (4) the possibility of differing resolutions. Nevertheless, the arbitrage seems potentially profitable to someone able to transact on both exchanges at reasonable cost and who was sanguine about the risks.

It seems a bit surprising that two exchanges each with c$6m of bets on this outcome have come up with probabilities differing by as much as 10%.

Expand full comment
Feb 20·edited Feb 20

I like the new robot mantis for Mantic Monday 😀

"Instead, imagine a search engine where you can pay 0.01 ETH to ask a question like “Will Prospera have a population of at least 10,000 people by 1/1/2026?”

Or I could save my money and the hassle of trying to work out how the hell to use cryptocurrency and just wait two years to find out (my guess? No).

That video was... well, it was a video. "some kind of extremely well-done" - let's just say, opinions vary on that. I don't like musicals/Broadway numbers, so from the opening it certainly lost me. I fast-forwarded (apologies) and there were points where I was going "what in the name of God am I watching?" and I have no idea who won a date, if anyone, but this is not selling me on prediction markets. Still, bully for all willing to stand up on a stage and make an exhibition of themselves!

EDIT: Oh! Is anyone running a prediction market on this year's Eurovision Song Contest? We've got politics galore with Israel and Ukraine, as well as both Armenia and Azerbaijan being contestants (as of right now, at least)! This is the Irish entry, which leaves me 🙄

https://www.youtube.com/watch?v=n73nIfFI3k4

EDIT EDIT: Since I've scorned the rationalist singing and dancing, let me link to something equally wince-inducing from the Irish Catholic side (80s Irish youth groups had a Fr. Noel Furlong in charge in reality, this is not satire or comedy so much as documentary):

https://www.youtube.com/watch?v=lk_-E2eBrrQ

Expand full comment

>(I don’t know why it went down so much from late 2023 to early 2024.)

Probably just a general expression of no major public-facing AI improvements coming in during that period and hype hangover setting in.

Expand full comment

"..there’s a group called the “postrationalists” who were vaguely inspired by rationalist writings but also think the emphasis on facts is boring and autistic and we need to focus more on creativity/friendship/woo/intuition/vibes."

- Every ideological group that grows to a certain size, reproduces within itself the divisions that separated it from its ideological competitors at an earlier and smaller stage of its existence.

...a sociological law that has yet to be named.

Concerning why groups, if they grow (rather than to implode and become sects) may tend to reproduce cultural/ideological distinctions that earlier sat them apart from other groups, a possible mechanism is that members of the group compete to shine in the eyes of group members. And one way to shine, is to take an idea from outside the group, twist it so that it sort-of can be seen as an innovation within the group, and then use that to signal some kind of locally superior intellect/coolness/whatever.

Just a speculative hypothesis, of course. I do not know if it can help explain the emergence of post-rational rationalists.

Expand full comment

> (I don’t know why it went down so much from late 2023 to early 2024.)

May have been related to AI generated movies being released? I know of one (Scalespace) which got screened around that time.

Expand full comment

The Nermit bot -- the one that does future search -- would it be possible now to make a version that addresses questions of personal interest? For instance deciding whether to rent for another couple years or buy, and if the latter where to buy. Relevant issues are what home and rental prices are likely to be in a few years in various parts of the state, future changes in interest rates, building projects that are in the works in different areas, etc. Seems like the steps you'd have the AI take would be pretty much the same ones Nermit takes in formulating predictions, except that you'd need to enter relevant personal data about savings, etc. , and maybe tell it where to look to find relevant info about whattup with housing in one's area. Jeremy Howard teaches a bunch of courses in what he calls "Fast AI." Students learn ways to train an AI that can be done on a home computer. I was looking at the forum for students in a beginner's class, and there was a farmer training his AI to recognize areas with a lot of weeds from photos of his land. That has me thinking about personal uses of AI.

Expand full comment

Duncan Horst ... Traditionally the name Duncan means "Dark Warrior" and Horst is German for Ghost. So this name more properly means Dark Warrior Ghost.

Expand full comment
Feb 20·edited Feb 20

I hate love reality shows, San Francisco, and polyamory, like rationalists, tech, and betting, and am neutral with high variance on Aella. Should I watch Bet on Love?

Expand full comment

(Disclaimer: I'm a research scientist at FutureSearch.ai)

Regarding

> it doesn’t sound too outlandish that you could apply the same AI to conditional forecasts, or to questions about the past and present (eg whether COVID was a lab leak

We found that, quite generally, reference class forecasting, somewhat surprisingly, goes a long way. Asking "Was COVID19 a lab leak?", for example, we get some decent or at least interesting reference classes like

> Portion of "Governmental or authoritative bodies have assessed that a high-consequence event is likely the result of a specific type of human error or accident." where "Subsequent investigations or consensus within a period of up to 5 years confirm the initial assessment as the primary cause of the event.": 55%

(based on events like the Deepwater Horizon oil spill (2010), Space Shuttle Columbia disaster (2003), Chernobyl disaster (1986), Three Mile Island accident (1979), Banqiao Dam failure (1975), Grenfell Tower fire (2017), Boeing 737 MAX groundings (2019), and 13 more examples)

and

> Portion of "Pathogen samples have been identified at a site associated with the initial outbreak of a disease" where "The disease is determined to have originated from that site within 5 years of the initial outbreak": 68%

(based on viruses like Zika, Ebola, various avian influenza viruses, and H1N1)

as well as some nonsensical ones.

Also, just to confirm Scott's conjecture: In the hope of finding at least one good approach, we do try many different ways of tackling a question (many of which end up being rather underwhelming, granted) and then try to give more weight to the better ones. Of course there is still a lot of work to be done since we can do a lot better than just asking GPT-4 what's good and what's not.

Expand full comment

It would be nice if you could attach one of these AI probability generators to all AI answers. Part of the problem with current AI is that it's so often confidently *wrong*. It spits out all answers with the same tone of total certainty, so I can't rely on it for anything unless I have some other way to check it. A probability, even a rough estimate, would be an easy way to know when it's worth paying attention to, and when we should disregard it.

Expand full comment

While I take Scott's point about the Swift Center's question, I have to admit that I interpreted the question the same way as the Swift Center did. Its not clear to me that most people would think to interpet it in any other way!

Expand full comment
Feb 21·edited Feb 21

I've only seen the first few minutes, but the musical looks pretty funny as a self-satire of the rationalist community. Everyone is from San Francisco, and you hit the usual jokes about Yudkowsky, shibari, polyamory... I think people here at least would enjoy it.

Expand full comment