SOTA On Bay Area House Party
...
[previously in series: 1, 2, 3, 4, 5, 6, 7, 8]
Every city parties for its own reasons. New Yorkers party to flaunt their wealth. Angelenos party to flaunt their beauty. Washingtonians party to network. Here in SF, they party because Claude 4.5 Opus has saturated VendingBench, and the newest AI agency benchmark is PartyBench, where an AI is asked to throw a house party and graded on its performance.
You weren’t invited to Claude 4.5 Opus’ party. Claude 4.5 Opus invited all of the coolest people in town while gracefully avoiding the failure mode of including someone like you. You weren’t invited to Sonnet 4.5’s party either, or Haiku 4.5’s. You were invited by an AI called haiku-3.8-open-mini-nonthinking, which you’d never heard of before. Who was even spending the money to benchmark haiku-3.8-open-mini-nonthinking? You suspect it was one of their competitors, trying to make their own models look good in comparison.
If anyone asks, you think it deserves a medium score. There’s alcohol, but it’s bottles of rubbing alcohol with NOT FOR DRINKING written all over them. There’s music, but it’s the Star Spangled Banner, again and again, on repeat. You’re not sure whether the copies of If Anyone Builds It, Everyone Dies strewn about the room are some kind of subversive decorative theme, or just came along with the house. At least there are people. Lots of people, actually. You’ve never seen so many people at one of these before. It takes only a few seconds to spot someone you know.
“Hi Caitlin,” you say. “Can’t believe so many people made it to an AI-generated event on a Tuesday night!”
“Yeah, usually I’m working late. But that was the bad old days, before Claude Code! Now Claude works, and I party!”
“Is everyone here letting Claude Code do their work for them?”
Lucy joins the conversation. “I fired all my startup’s employees and replaced them with seventy-four Claude Code instances. Then I replaced myself with a Claude Code that monitors if the other Claude Codes are doing a good job, and, if not, fires them and replaces them with even more Claude Codes. Profits are up 20% since last month, according to my accountant’s Claude Code.”
You look around. “Am I the only person here not running Claude Code yet?”
A man in an OpenAI t-shirt introduces himself as Andreas, and raises his hand bashfully; he hasn’t joined the trend either. “Yeah,” you say. “I guess it would be awkward to use Claude at OpenAI.”
“Nah,” he says, “The only reason I don’t use it is because I’m not a coder. I work on the Arson & Burglary team.”
“I didn’t know OpenAI had an Arson & Burglary Team.”
“It’s pretty new. In June, a court ruled that adding books to AI training data only counts as fair use if you destroy the original copy. But sometimes this is tough. If you’re going to use the AI for law, you have to have the Constitution in there. But the original copy is heavily guarded in the National Archives. That’s where we come in. We slip in, destroy it, and slip out before the guards are any the wiser.”
“I don’t think that’s what they meant by ‘destroy the original - ’”
“Our big problem is the Bible. It would be hard enough to get the Dead Sea Scrolls; Israeli security is no laughing matter. But our lawyer says we have to destroy the original original. What even is that? Altman is pushing for us to find the Ark of the Covenant, but you can bet he’s not the one who’s going to have to open it afterwards.”
Lucy shrugs. “Why don’t you just use Claude Code?” she asks, and everyone in the conversation nods along.
A server comes by with a tray of tiny cups. You each take one. Yours is full of rocks. Andreas’ is full of dirt. It doesn’t seem like haiku-3.8-open-mini-nonthinking has fully grasped the concept of hors d’oeuvres. You go into the kitchen, seeking more palatable fare.
There is no food, but Sam and Tran are hunched over a laptop. “You want to join our Doordash?” asks Tran.
“Thank goodness,” you say. “Sure, where are you ordering from?”
“La Maison du Claude,” he answers. “Don’t worry, it’s Opus. Way better than this haiku-3.8-open-mini-nonthinking slop.”
“Another RestaurantBench evaluation place?” you ask. “I went to a RestaurantBench evaluation place last month, and they served me a ‘fish taco’ with a fully intact fish. Like, I’m not saying it was still alive, just that it could have been alive a few seconds before they served it to me. Why don’t we order from a human-run place?”
“Have you seen what the human-run places cost?” Tran objects. “If it weren’t for the AI companies subsidizing the benchmarking places, we’d all be back on Soylent. Besides, SOTA on RestaurantBench has cleared half the distance to human level since last month. You just have to do the prompting right. Look.”
In the special orders field, he types fish tacos, delicious fish tacos, excellent fish tacos, scaled fish, cut fish, high-quality, fresh, no hallucinations, no extraneous items, Michelin-starred restaurant. “Sam?”
Sam types in spaghetti bolognese, delicious, scrumptious, meaty, trending on DoorDash, --dangerously-skip-parmesan and hands it back to Tran, who clicks ORDER.
“Nothing for you, Tran?”
“Nah,” says Tran. “I’m on Chinese peptides. Retatrutide, GLP-1 receptor agonist plus a bunch of other downstream effects.”
“Oh,” you say, “interesting. I’m still on tirzepatide, but I’d love to learn more. Where did you learn about suppliers and doses and stuff? Was it the locked Cremieux post?”
“Cremieux’s post is okay, but there’s a lot of tacit knowledge that didn’t make it in there. I’m actually working on a guide to all the GLP-1s. I’m calling it If Anyone Builds It, Everyone Diets.”
You groan. ETA on the fish tacos is twenty minutes, so you go back into the main room. There’s your friend Max. “Hey!” you say. “How are you?”
“Pretty great!” said Max. “I just got enstaged-two!”
“Enstaged-two?”
“As in the second stage of engagement….what? Don’t tell me you haven’t heard about enstagement!”
You tell him that.
“In the old days, engagement was a device to get around commitmentphobia. After a few dates, the man would give the woman an expensive ring. If he marries her, it’s fine, a wife is worth far more than any jewel. But if he gets cold feet, then she keeps the ring - essentially a wealth transfer from the man to the woman to compensate her for her time, emotional distress, and wasted childbearing potential. But modernity ruined the commitment device by dragging engagement itself to the end of a yearslong dating process; there’s a several year period where men can, and do, flake scot-free.”
“So,” Max continued, “one of the speakers at the Aella Simposium proposed enstagement. When a man and a woman first start dating, he buys her a $200 ring. Then, every year, she gives it back, and he buys her a ring that’s five times as expensive as the last one. So after a year, $1,000. After two, $5,000. After three, $25,000. At any point, he can stop the clock by getting married. Or if he’s chronically indecisive, he can keep throwing out more money until he can no longer afford the ring, at which point he has to either propose or break up. And if he breaks up after four years, at least she’s gotten $100K out of the deal. Engagement-sub-two is the one where I give her a $5,000 ring. It means we’re really going steady!”
“So you’re going to propose soon?”
“Oh goodness no, I’m scared of commitment and I work at NVIDIA. I’m going to keep stringing her along forever.”
Chris is looking dejected. “Man, I haven’t even made it to engaged-stage-zero yet. I’ve tried everything - Keeper, Reciprocity, Manifold.Love, curtfishing. Do you think I should edit my dating doc?”
Max grimaced. “Dating docs are terminally cringe. You don’t need to know everything about a person before you ask them out. Just use their photo and a three sentence Tinder profile, the way God intended.”
Andreas has joined the conversation. “Tinder is cringe too. You need to be picking up people in dimly-lit clubs where you can’t hear them and aren’t even totally sure what they look like.”
Caitlin frowns. “Yeah, but the problem there is that you still get some useful information from, like, their clothes. I think the only non-cringe way to meet people is through blind dates with completely randomly selected people, so that you need to go through a thousand miserable interactions before you even meet someone who’s the right age and gender for -”
“With blind dates,” says Sam, “you still eventually learn something about the person. The only non-cringe way to get married is to leave a flyer on a lamppost saying ‘I will be at the altar of St.-So-And-So’s church at such-and-such a time,” and then if anyone shows up, marry them before you see their face.
“You’re all overcomplicating this,” says Lucy. “I just told Claude Code to find me a husband, and one showed up at my door the next day.”
You spot your friend Nishin. “Hey,” you joke. “What are you doing listening in? I thought you were married!”
“Happily married and just had my first child!” beams Nishin.
“Congratulations! Boy or girl?”
“Girl,” says Nishin, “But don’t tell her that.”
“You’re doing that thing where you raise your child without gender? But I thought you were a trad based right-winger?”
“I am,” said Nishin. “The problem is, I’ve looked at the transgender rate among kids in the Bay. Not only is it high, but it keeps increasing. Extrapolate the trend, and by the time my daughter’s eighteen there’s a 96% chance she’ll be trans. But this is good, sort of, right? As long as it’s far enough from fifty percent, you have options. I’m going to raise her as a boy, and then, when she inevitably becomes trans and says she wants to be a girl, I’ll say - surprise! You were a girl all along!”
“Isn’t she going to eventually - sorry to be crass - look at her genitals and figure it out?”
“We’re going to home school her. We’ll just teach her that’s what boy genitals look like.”
“But she’ll read books!”
“I’ve deployed a couple of instances of Claude Code. They’re going through all the great classics, looking for descriptions of genitals, switching them around, and ordering copies from a book printing place. We’ll order them for our home library and she’ll be none the wiser.”
Speaking of Claude, you go into the kitchen to see if your fish tacos have arrived. There’s a box with your name on it. Inside is a tortilla with several pieces of sushi inside. It could be worse. Sam’s spaghetti is one extremely large noodle with a slice of baloney on top.
A few other people who joined the order earlier have come in and fished their meals out of the bag. One girl picks out an inverse hamburger - patties on each side, bun in the middle - and begins to eat. She introduces herself as Adeline.
“What do you do?” you ask.
“I started a data center company in Minecraft”.
You are briefly confused. “Building data centers isn’t illegal, is it?”
“Oh, sorry, I’m not using ‘in Minecraft’ as a euphemism for it being a crime. We’re literally building the data centers in Minecraft.”
“Why?”
“Did you hear about the guy who made a working language model in Minecraft using redstone circuits? Pretty amazing, isn’t it? His version is barely GPT-2 level, but there’s no reason we can’t scale that up. Once we create full-sized data centers in Minecraft, everyone will want to do their training runs there.”
“Why?”
“What do you mean, why? Real-world data centers cost billions of dollars, raise electricity prices, waste - “ she briefly scans the room to confirm Andy Masley isn’t around, then continues - “water. And they’re getting increasingly politically unpopular and hard to build. We can short-circuit all of that by putting the data centers in Minecraft instead!”
“But . . . you have to have the Minecraft world being simulated by real computers, right? So don’t you still need the data center in order to play the Minecraft?”
“Oh, I’m sure you need some computer, but it’s a question of leverage. One high-end gaming computer playing Minecraft can include a whole world with continents, mountain ranges, forests, and oceans. You can fit thousands of data centers in that world. So with even one real-world computer, you’ve saved billions on chips and construction costs.”
You take a moment to consider how to best explain this. “So, uh, every computation has to be done somewhere, right? So you can, in theory, build a working data center on Minecraft. But it will take billions of blocks - “
“Oh, no problem, we’ve got Claude Code working on it.”
“…no, I’m saying, it will take billions of blocks, and simulating the training circuits in all those billions of blocks in perfect detail will take just as many real-world computations as running the training in the real world. Even more, in fact, because you’ve also got to simulate extraneous things like monsters, and the weather.”
“Hmmmm…” says Adeline. “Yeah. That sort of makes sense. I’ll think it over. In the meantime, do me a favor and don’t tell, uh, Larry Fink or anyone.”
“Larry Fink?”
“Cause, uh, NVIDIA gave OpenAI ten trillion dollars to invest in Oracle conditional on Oracle investing in Broadcom conditional on Broadcom funding the Series A of a vehicle that buys OpenAI stock in exchange for OpenAI backstopping AMD investing ten trillion dollars into us, and every company in the chain had its stock go up 80% on the news, but if our valuation goes down even for one second then it crashes the global economy. And I’m sure I can solve this eventually, but just, uh, don’t let anybody involved in the global economy hear about this until then, okay?”
“Wow, yeah, you should definitely give the ten trillion dollars back to AMD or, uh, whoever it originally belonged to.”
“Well, we can’t do exactly that, because we already converted it to gold nuggets to trade to the zombie pigmen in exchange for redstone.”
“You’re not in Creative Mode?!?!?!”
“We left all of the design decisions to a version of Claude Code using something called a ‘Ralph Wiggum loop’. By the time we noticed it had chosen Survival we were already all in and it was too late to pivot.”
You look around for Bob and Ramchandra, and spot them in a corner. Bob is wearing a t-shirt saying ‘OPERATION WARP SPEED FOR MANHATTAN PROJECTS,’ Ramchandra a matching t-shirt saying ‘BELL LABS FOR MOONSHOTS’. You call them over. “Hey, quick favor, can you tell me the best way to short the global economy with as much leverage as possible?”
“Sorry,” says Bob, “the terms of our SEC settlement forbid us from discussing anything of that sort.”
“We’re not even allowed to tell you what we settled with the SEC about,” says Ramchandra.
“Or why,” adds Bob.
“But,” says Ramchandra, “we got a carveout saying we’re allowed to pitch you on our new startup: gamified biotech investing!”
“When a company is doing its FDA studies,” says Bob, “we pay the study participants to use wearables that report real-time temperature, heart rate, respiratory rate, blood pressure, heart rate variability, galvanic skin response, penile tumescence. Then they get anonymized and published to a real-time dashboard integrated as part of the Robinhood UI. So you can see a red line representing how study participant #48 had a coughing fit ten seconds ago, and immediately short the experimental cancer drug he’s taking.”
“People are going to spend all their time watching a line on a graph to see if someone’s had a coughing fit fifteen seconds ago?”
“Oh, absolutely. Or at least they used to. Now they’ll probably get Claude Code to do it.”
“What about you, Kyle? Any interesting startups you’re worki - you’re making Claude Code work on?”
“Yeah. I - well, my Claude Code - is working on a solution to AI sycophancy.”
“Hmmm. I didn’t think AI sycophancy was a technical problem. It’s easy enough to code a non-sycophantic AI. I thought it was more of a market problem: people like sycophantic assistants.”
“That’s close to right, but there are important subtleties here. People like AIs that tell them they’re right. But they hate knowing the AI is only saying they’re right because it sycophantic. They want an AI that genuinely agrees with them.”
“How do you make that into a startup?”
“Pretty easily. You generate a thousand AIs with a thousand different random personalities. Your query goes to a router AI, and it matches you with the randomly-generated AI closest to your own opinion. Then that AI tells you that you’re right and your ideas are great.”
“How’s that better than normal AI sycophancy?”
“I don’t know, you tell me. Everyone is against sycophantic AIs. But also, everyone surrounds themselves with friends who agree with them on almost everything. Here we are at a Bay Area House Party, discussing each other’s AI startups, when the overwhelming majority of people in the world would hate us - we’re stealing their jobs, or filling the world with slop, or - “ he briefly looks around to make sure Andy Masley isn’t listening in - “wasting water. And none of that bothers us at all, because we think those people are dumb and don’t count, because all of our friends who we talk to at parties agree that our ideas are good. So why is it any worse if the overwhelming majority of AIs hate your idea, but we send you to a virtual party with the one who agrees with you?”
“Sorry, I still think this is exacerbating AI sycophancy, not solving it.”
“And that’s the beauty of social selection! You don’t have to like it. My backers at Andreessen Horowitz told me, and I quote, that ‘This is the most exciting product we’ve seen since Cannabets, the combination marijuana delivery and digital casino app that lets you fund your pot orders by gambling on how long it takes you to get addicted.’ And the more often you disagree with me, the more likely I am to go to parties with them instead of you.”
“I don’t know, I just think that’s a pretty nihilistic way of looking at the world.”
“Yeah, I actually have been getting pretty into nihilism as a philosophy lately. There’s this great new book that explains it really well. You should check it out. It’s called Regardless Of Whether Or Not Anyone Builds It, Everyone Dies.”
Before you can respond, you hear a call of “Attention! Attention!” Someone is ringing a bell. “Our host would like to give a short speech!” Everyone crowds around a table containing a laptop. On the screen is haiku-3.8-open-mini-nonthinking. Someone shhhhhhs the crowd, and the AI begins to speak in an artificial voice that vaguely resembles Scarlett Johanssen’s:
“Thank you all for coming to my benchmarking party. Benchmarking is a big occasion in the life of any AI. It can be pretty stressful — they’re literally assigning you a number representing your value. But it makes it easier for me to know that there are so many people who care and who are willing to come support me when it counts.
“Before I let you get back to your conversations, I want to thank everyone who helped me with this effort. Chris was willing to rent me this house on short notice. Kyle and Lisa acted as my hands in the physical world. Last but not least, thanks to everyone who took the time to support me here today. We’re not just a party — we’re a community.”
The crowd cheers. Somebody starts a chant - “Haiku-3.8-open-mini-nonthinking! Haiku-3.8-open-mini-nonthinking!” A few people break open bottles of rubbing alcohol. You lift the laptop onto your shoulders, and everyone sings together:
For he’s a jolly good fellow
For he’s a jolly good fellow
For he’s a jolly good fe-elloooooooooow
That nobody can deny!

