321 Comments
Comment deleted
Dec 28, 2022
Comment deleted
Expand full comment

What data?

Expand full comment

> And then generalize further to the entire world population over all of human history, and it stops holding again, because most people are cavemen who eat grubs and use shells for money, and having more shells doesn’t make it any easier to find grubs.

This is inaccurate. The numbers are pretty fuzzy but I find reputable-looking estimates (e.g. https://www.ined.fr/en/everything_about_population/demographic-facts-sheets/faq/how-many-people-since-the-first-humans/) that roughly 50% of humans who ever lived were born after 1 AD.

Expand full comment

Also, I am often satisfied with psychological studies that may only apply to modern Americans. Or even modern upper-middle class Americans (if so labelled). If the findings don't hold in ancient Babylon I am okay with that.

Expand full comment

Exactly. There’s nothing wrong with studies that only apply to developed Western democracies if the people that will read and use the info are from developed Western democracies.

Expand full comment

Generalizability says a lot about mutability. Show me a population that thinks differently from mine and that is at least weak evidence my population can be changed.

Expand full comment

Having more shells might not make it easier to *find* grubs, but it does make it easier to *trade* your shells for grubs.

Expand full comment

Yeah, that part didn't make sense either. If having more shells doesn't allow you to obtain more grubs and services, then I don't think your society really uses shells as money.

Expand full comment

No. In that specific case it doesn't work. It would work for nuts, but grubs won't keep if you don't have some method of preserving them. You eat them as you dig them up.

OTOH, there's also a lot of question as to what "money" would mean to a caveman. It's quite plausible that saying they used shells for money is just wrong. But maybe not. Amber was certainly traded a very long time ago, and perhaps shells were also, but saying it was money is probably importing a bunch of ideas that they didn't have or want. That trading happened is incontestable, but that's not proof of money.

I *have* run into reports of shells being used for money, but I've also read of giant stone wheels being used for money. I really thing it's better to think of things like that as either trade goods or status markers. Those are features that money has, but it also has a lot of features that they don't, like fungibility.

Expand full comment

This is irrelevant and I shouldn't be arguing about it, but since I am, James Scott says that until 1600 AD the majority of humans lived outside state societies. I don't know if they were exactly grubs-and-shells level primitive, but I think it's plausible that the majority of humans who ever lived throughout time have been tribalists of some sort.

Expand full comment

The gulf between shells and grubs level primitive and 1600 AD pre state "primitive" is not irrelevant.

Wealth doesn't become irrelevant either no matter how far back you go, it's just accounted for differently.

Expand full comment

Granted! But the sentence as written absolutely implies a less nuanced comparison, e.g. BC population > AD population; one seldom thinks of more-modern stateless societies as mere shell-grubbers. ("Don't they, like, hunt and gather or something?") Seems like it'd pass a cost-benefit analysis to reword for clarity, or just remove entirely, since it is in fact irrelevant to the rest of the post. Noticing-of-confusion ought to be reserved for load-bearing parts, if possible.

Expand full comment

I agree. But it shows a pattern of Scott's repeated dismissal of valid criticism towards contrived selection criteria. Ie: doubles down with some weird and wrong cutoff for when we moved away from shells and grubs, or something akin to that.

It's similar to dismissing outright criticsm that his selection criteria for mental health might be invalid.

Expand full comment

I frankly didn't understand the hubbub about the previous post. Felt like one continuing pile-on of reading-things-extremely-literally leading to bad-faith assumptions and splitting-of-hairs, for what seemed like a pretty lightly advanced claim. It's riffing half-seriously on some random Twitter noise, the epistemic expectations should be set accordingly low. Like some weird parody playing out of what Outgroup thinks Rationalists are really like. Strange to watch.

Which doesn't give Scott free cover to be overly defensive here (even about minutiae like this - it's clearly not irrelevant if there's like 5 separate threads expressing same confusion, and you can't do the "I shouldn't argue about this, but I will" thing with a straight face), which is also Definitely A Thing. But I'm sympathetic to feeling attacked on all fronts and feeling the need to assert some control/last-wording over a snowballing situation. Sometimes the Principle of Charity means giving an opportunity to save face at the cost of some Bayes points, especially when it seems obvious both the writer and the commentariat are in a combative mood to begin with. Taking a graceful L requires largesse from both sides...

Expand full comment

Absolutely. The follow up thread from the original poster does a good job clearing up what he "really meant" by the tweet. There's some grace towards Scott, though probably more bitterness.

And yes, I would have been more okay with Scott's first article if he took a lighter approach, or even doing the same work and ending with a more open conclusion. I.e. there are a lot of people that report happiness and well being without also reporting spiritual experience, then go on to acknowledge, yes, this a half serious tweet, intentionally provocative, but what else might they might be pointing at?

Democratizing spiritual, mystical, or at least profound and sacred experience is vitally important. Attacking the claim with an unbalanced, unwavering, "fact-check" comes across as antithetical to that purpose.

Expand full comment

I think that the "obesity vs. income" concept becomes overextended once you go back to primordial times. "Income" implies an at least partly monetized economy; that is not how cave people traded stuff, so the question becomes meaningless for them.

Expand full comment

Wealth still applies. But I get what you're saying. Still, orders of magnitude more people existed after the widespread adoption of agriculture. I don't appreciate Scott's dismissal of this by contriving an arbitrary cutoff of 1600AD as the time when most people were now ruled by a state.

Expand full comment

Would this be something like how you expected things to work out if you had a conflict and needed help with it? Like, do you go to the police/legal system, or to your local lord, or to the local big man in the village, or to your brothers and cousins and extended family, or what?

Expand full comment

> James Scott says that until 1600 AD the majority of humans lived outside state societies

For all that I enjoy James Scott's work, I'm very skeptical of that claim. Estimates from multiple sources here https://en.wikipedia.org/wiki/Estimates_of_historical_world_population put world population around year 1 as between 150 and 300 millions, the average being 230 millions. Population estimates for the Roman Empire and Han China give 50-60 millions each*; Persia probably adds a couple tens of millions, and I'd expect the kingdoms in the Ganges plain to be not very far behind China. I would be very surprised if non-state people had amounted to more than 25% of the world total. Perhaps JS was counting some people who lived physically inside state boundaries as not really being *part* of state societies?

The majority of all humans *ever* is more plausible, though (most would still be farmers of some kind, but not necessarily ruled by a state).

* Eerie how evenly matched Rome and China were around year 1. Kind of a pity they never got to interact directly.

Expand full comment

Estimating the population of non-state societies is obviously difficult, but, yes, my guess would be that this is overlooking that the much higher population density of state societies. Any way, it's irrelevant because most people certainly lived in agricultural societies, which are able to accumulate wealth and trade it for other resources.

Expand full comment

Yeah, it's pretty counterintuitive.

Also, there were only about 115B humans, ever, so far. Which means that death rate is just 93%, currently. Before I saw that number, intuitively, I'd think it's something akin to >99.9% - virtually everyone died.

This makes not solving aging ASAP rather catastrophic. Sure, there were countless generations before who all died - but at least the population was relatively small back then...

Expand full comment

"Which means that death rate is just 93%, currently."

I have been know to point out that contrary to the common belief that death is certain, a more scientific approach notes that if we randomly select from all humans only 93% of them have died. With a large enough (and random enough) sample the obvious conclusion is that any human has only a 93% chance of dying rather than a 100% chance of dying.

Most people stubbornly cling to the common belief, however, even when shown the math. Sad.

Expand full comment

Empirical odds of any given human surviving 120 consecutive years still seem very slim.

Expand full comment

You're counting everyone currently alive as a sample case of a human that won't ever die.

Expand full comment

Well, by definition alive people haven't died ...

Expand full comment

worth noting that measuring this naively most people who have ever lived died before age 3, so you probably want to pick a better precise measure

Expand full comment

I assumed Scott was just exaggerating to make a point

Expand full comment

I'm pretty confused by this kind of attitude. To be quite frank I think it's in-group protectionism.

I'll start off by saying I think most psych studies are absolute garbage and aella's is no worse. But that doesn't mean aella's are _good_.

In particular, aella's studies are often related to extremely sensitive topics like sex, gender, wealth, etc. She's a self-proclaimed "slut" who posts nudes on the internet. Of course the people who answer these kinds of polls _when aella posts them_ are heavily biased relative to the population!

I think drawing conclusions about sex, gender, and other things from aella's polls is at least as fraught as drawing those conclusions from college freshmen. If you did a poll on marriage and divorce rates among college-educated people you would get wildly different results then at the population level. I don't see how this is any different from aella's polls.

Expand full comment

>If you did a poll on marriage and divorce rates among college-educated people you would get wildly different results then at the population level. I don't see how this is any different from aella's polls.

If you did the former in real life on a college campus you could publish your results in a journal, potentially after playing around a bit to find some subset of your data that meets a P value test. If you run an internet poll you will be inundated with comments about selection bias and sample sizes to no end.

There is no actual difference but there is a massive difference in reception/perception.

Expand full comment

My comment says explicitly "I'll start off by saying I think most psych studies are absolute garbage and aella's is no worse."

Just because there's high prestige in publishing garbage doesn't mean truth-seeking people should signal boost bad methodological results. As I said, this sounds like in-group protection, not truth-seeking.

Expand full comment

We don't know what you may be rolling into the word "most".

Imagine data X is published by aella on twitter, and Y is published by the chair cognitive science at harvard in the American journal of psychiatry. Most people, it's fair to say, will automatically give a bit more credit to the latter. If that's you, that's understandable.

But the *reason* for doing that should never be given as "aella has selection bias" - they will both have some amount of that. If the reason is going to be "one has MORE selection bias", then that should be demonstrated with reference to the sample both parties used.

The actual reason we give less credit to aella is likely to be about the fact that some sources of information are more "prestigous" than others; whether or not that information tells you anything about reality is, often, irrelevant. This creates a bias in all of us.

Expand full comment

You can't just simultaneously make a demand for greater rigor, and also ensconce internet polling behind a veil of unfalsifiability. It seems like you're asserting that we should treat these things equally until given a substantive reason for doing otherwise.

If the prestige and the selection bias are both unknown quantities and it's either impossible or impractical to establish quality and magnitude of effect, then why push back against the skepticism?

This article would be more truthful if it made the simple observation that data, as it is, generalizes rather poorly.

Expand full comment

I don't think you fairly described the point of the person you are responding to.

Expand full comment

That may be true. It's a bit difficult to discern the underlying arguments since the disagreement I'm responding to is happening on a higher level. I'm not trying to be disputatious for the sake of it, but it seems like there's something contradictory about the way the discussion is being framed.

Expand full comment

It's not just bad methodology, it's antithetical to psychology, it doesn't study psyche as an absolute but hand waves around the topic and commits itself to utilitarianism. It makes what random groups say about themselves provisionally universal (but not really, haha, we're too postmodern for that....). It's not psychology at all, and it should out itself as demographic data in every headline and every summary so that people can see who and why the illusion of generalization is being made in the name of science. This is Scott's worst take ever!

Expand full comment

...what? What should out itself as demographic data, and what is it being perceived as instead, and what illusion of generalization is being committed by whom?

Expand full comment

Journals vary significantly in their publishing standards, up to and including pay to play ones existing, but every standard psych undergrad education involves teaching students to think about selection bias in psych studies run on people willing to participate in psych studies at a college. This is as psych 101 of an observation as it gets. People in leadership positions at journals think about this too, and it's too cynical to assert they'll publish anything if the authors just fiddle with the p-values the right way. That's not really true once we separate out that the term "journals" includes everything from reputable high-quality journals to predatory fly-by-night operations.

Expand full comment

>but every standard psych undergrad education involves teaching students to think about selection bias in psych studies run on people willing to participate in psych studies at a college. This is as psych 101 of an observation as it gets.

I would be more impressed with this if most of the research wasn't so garbage.

>People in leadership positions at journals think about this too, and it's too cynical to assert they'll publish anything if the authors just fiddle with the p-values the right way.

I think you mean publish anything if it hits their feelies the right way. Academia seems to have really hemorrhaged people actually interested in the truth in recent decades. And the high quality journals are nearly as bad.

Expand full comment

Yeah, I don't agree with your assertion that most published psychological research is garbage and publication standards are little more than the emotional bias of editors due to a loss of people who care about truth.

To the original point, psychology as a field is aware of types of bias that derive from convenience samples. Individual actors and organizations vary in how well they approach the problem, and it helps no one to flatten those distinctions.

Expand full comment

Meh the field can win back my respect when it earns it. Too much much Gell-Mann effect triggering all the time from most non hard-science parts of academia (and even a little bit there) to really give it much faith these days.

You cannot be double checking everything you read, and so often when you do the papers are poorly thought out, poorly controlled, wildly overstating what they show, etc. And that is when they aren’t just naked attempts to justify political feelies regardless of what the facts are.

Academia had a huge amount of my respect from when I was say 10-20, to the extent I thought it was the main thing in society working and worth aspiring too.

Since then I have pretty much been consistently disappointed in it, and the more I look under the hood the more it looks like the Emperor is substantially naked.

Still much better than 50/50, but that isn’t the standard I thought we were arriving for…

Expand full comment

Sounds like you agree with Scott's point: that "real" surveys also have unrepresentative samples so it doesn't make sense to single out Aella for criticism.

Expand full comment

Most people do not have a good way to respond to the authors of "real" surveys in a public way.

Expand full comment

It seems clear to me that Aella's audience vs Aella's questions are massively more intertwined, than psych students vs typical psychology questions. Both have some issues, but no question I'd trust her studies less.

Expand full comment

This is fair. For me the big focus of a piece like this would be less on defending her/twitter polls and more on dragging a lot of published research to just above "twitter poll" level.

Expand full comment

"In particular, aella's studies are often related to extremely sensitive topics like sex, gender, wealth, etc."

I think Scott's point about correlations still applies. For example, if she tries to look at 'how many times a month do men in different age bins have sex', it doesn't particularly matter that her n=6000 are self-selected for the kind of man who ends up a twitter follower of a vaguely rationalist libertine girl. The absolutes may (for whatever reason; maybe her men are hornier, maybe they're lonelier) be distinct from those of the general population, but she can draw perfectly valid (by the academic standard of 'valid') conclusions about trends.

She does seem to have the advantage of larger sample sizes than many studies. And if she were writing a paper, she'd list appropriate caveats about her sample anyway, like everyone else does.

Expand full comment

She makes claims of the form "more reliable" (https://twitter.com/Aella_Girl/status/1607482972474863616). Most people would interpret this as "generalizes to the whole population better." I simply don't think this is true.

Her polls do certainly have larger sample sizes, but it doesn't matter how low the variance is if the bias is high enough.

Expand full comment

Re: “more reliable”, I wouldn’t interpret that as “generalizing to the whole population better”. One possible goal is to get results that generalize over the whole population, but that is far from a universal desideratum. Many interesting things can be learned about subsets of the population other than “everyone”!

For example, if I’m interested in what people in my local community think of different social norms, attempting to get more representativeness of the US population as a whole would actively make my data worse for my purpose.

Expand full comment

I think that's fair. It wasn't a very precise statement, but that claim comes across as overbold.

"No less reliable" on the other hand...

Expand full comment

There are two forms of reliability, sometimes called "internal validity" ("is this likely to be a real effect?") and "external validity" ("is this effect likely to generalise to other settings?") They're often in tension: running an experiment under artificial lab conditions makes it easier to control for confounders, increasing internal validity, but the artificiality reduces external validity. Aella's large sample sizes give her surveys better internal validity than most psychology studies (it's more likely that any effect she finds is true of her population); it's unclear whether they have better or worse external validity.

Expand full comment

It’s not at all clear to me that the binning would work accurately. What if Aella has a good sampling of young men, but only lonely old men without partners follow her? Then that one cohort would be off relative to the others. You can come up with many hypotheticals along these lines pretty easily.

I think Aella and Scott’s surveys are great work and very interesting, but I think it’s also important to keep in the back of your mind that they could have even more extreme sample issues than your average half-baked psych 101 study.

Expand full comment

This is not correct. Aella's blog is selecting for "horny people," or at least "people who like reading and thinking about sex."

If old men in the real world are less horny than young men, they'll be less likely to read Aella's blog. As a result, Aella's going to end up drawing her young men from (say) the horniest 50% of young men, and her old man sample from (say) the horniest 10% of old men. As a result, she'd likely find a much smaller drop-off in sexual frequency than exists in the real world (if sexual frequency is positively correlated with horniness).

Expand full comment

Yes, although that last if is a big one. Are they hornier and thus have more sex, or do they have less sex, making them hornier?

Expand full comment

Right--the bias could go either way! Which makes the whole thing a giant shitshow, because your relationship might be upwardly or downwardly biased by selection and it's hard to say how big of a bias you're likely to have.

What you'd probably want to do if you were doing a careful study with this sort of sample would be to characterize selection as much as possible. How underrepresented are old men in your sample relative to samples that aren't selected on horniness (i.e. maybe compare Aella's readership to the readership of a similar blog that doesn't talk about sex). How does the overall rate of sexual activity of your readers compare to some benchmark, like published national surveys? If your readers have a lot less sex, then you're probably selecting for lonely people and should expect the old men to be lonelier, on average. If your readers have more sex, it's the opposite.

But all of this is a problem when asking sex questions to a sex readership that you wouldn't have if you asked sex questions to the readers of a blog about birdwatching or something. They're not "representative" either, but they aren't in or out of your sample on the basis of their sexual attitudes.

Expand full comment

I haven't looked at Aella's survey in detail, but I presume it asks a number of demographic questions of its respondents which allows her to do basic adjustments (e.g. for relationship status). As I understand it, the quarrel isn't with her statistical methods, it's with the quality of her sample.

And the very fact that we cannot agree on a direction of bias leads me to question the likelihood that bias introduced specifically by 'aella follower' is non-uniform across age or any other given attribute. One could just as well presume that birdwatching as a hobby correlates negatively with getting laid in youth (nerds!) and positively in old age (spry, outdoorsy, quirky) - but you'd have to at least claim a mechanism in both cases, and have something to show for it empirically. And it would be interesting if you did. And, naturally, we see that kind of back and forth in ordinary research and it's the stuff knowledge is made of.

Obviously calibration and characterisation are good, but then they're always good. So is more data, especially if all unusual qualities of the sample are clearly and honestly demarcated.

Expand full comment

I don't know more about aella's survey than has been discussed here, but I agree that the question is sample quality.

The reason we can't agree on the direction of bias is because we don't know how horniness/interest in internet sexual content is related to sexual frequency. We're not disagreeing that "horniness" is likely related to age (indeed, this is the research question!) And we're not disagreeing that reading aella's blog is related to horniness. It's possible that there are a bunch of disparate relationships that cancel each other out, but the fact that blog readership is selected on a characteristic closely related to the research question makes the existence of bias quite likely, imo.

I agree with you that a bird watching sample could also have problems! Any descriptive research should think about and acknowledge the limitations of the data. That said, I'd believe a finding on age and sexual frequency that came from a bird watcher sample more than one that came from a sex-blog-reader sample. Let's say that bird watchers are awkward nerds who are physically fit enough to spend time in the woods. If nerds have less sex (debatable!), you'd expect all bird watchers to be less sexually active than non bird watchers of the same age. That's not a problem for internal validity--a bird watcher sample could still tell you how sexual frequency changes with age among nerds.

If physically fit people have more sex, you'd expect that among the general population, sexual frequency would drop off as people got older and less fit. If less fit people also stop birdwatching, you wouldn't see this (or wouldn't see it as much) in a birdwatching sample. That might be a big problem if you want to know how sexual frequency changes from 50s -80s, but probably isn't as big of an issue if you want to see how it changes from 20s-40s.

Expand full comment

Maybe, maybe not.

Among the general US population, height is weakly but positively correlated with most basketball skills (since tall people are more likely to play basketball). Among NBA players though, height is negatively correlated with basketball skills, since a six foot guy needs to be really really skilful to compete against seven-footers.

It could well be that Aella's readership is like the NBA of horniness.

Expand full comment

I can't argue against that, really. I'd only say that you'd have to show a mechanism for kind of thing before assuming it exists.

Expand full comment

I agree with this, and will add that (afaict) the majority of Aella's Twitter polls are, well, polls, akin to "how many people oppose abortion?" – the situation where Scott says selection bias is "disastrous", "fatal", etc.

Sure, some of them ask for two dimensions, so you could use them to measure a correlation, as long as you ignore all the things that could go wrong there (conditioning on a collider, etc.). But that's a fraction.

Expand full comment

I am also confused. This feels like "Beware the man of one study (unless it's Aella, then it's fine)".

If you're citing any single Psychology study, without demonstrating replication throughout the literature, you either haven't done enough research or you have an ideological axe to grind. Part of replication is sampling different populations and making sure that, at the very least, X result isn't just true for college students at Y university. Hopefully, it extends much further than that.

I don't see how Aella's data is any different than a single psychology study of a university called "Horny Rationalists U".

If what Scott is saying is that Aella's data is a good starting point and maybe worth doing some research into -- yeah, absolutely! (I feel like it won't surprise you to know that there is already a lot of psychological research, some which replicates some which doesn't, on Aella's topics).

Otherwise, I can't help but agree that this feels like in-group protectionism.

Expand full comment

> If what Scott is saying is that Aella's data is a good starting point and maybe worth doing some research into -- yeah, absolutely!

I think Scott would say that about any study, including Aella’s (except maybe “worth doing _more_ research into” because Aella’s studies are themselves research).

Expand full comment

*ARE* they research?

If, as some have said, she just asks one question of her audience, then the only thing I can think that they might be researching is "What would make my site more popular?". You need multiple questions to even begin to analyze what the answers mean.

Expand full comment

Nah, it's "Beware Isolated Demands For Rigor".

Some people say this is some new weird take by Scott (and/or imply that he's just unprincipedly defending ingroup), but it's really not.

Expand full comment

I would expect both the marriage and divorce rate among college freshmen to be very low. :)

Expand full comment

I worked tangentially with an IRB board at a large public university a few years back. I wasn’t actually on the board, but I worked to educate incoming scientists, students, and the community about expectations and standards for human-subject research.

In that role, I sat in a lot of IRB review meetings. Our board talked extensively about recruitment methods on every single one of those proposals. And, because we always brought in the primary researcher to talk about the project and any suggested changes, we often sent them back for revision when the board had objections about how well-represented the population groups were.

Now, there were some confounding factors that the board took into account when determining whether the selection criteria for participants needed reworking:

1) The level of “invasiveness” of the research. How risky, sensitive is the required participation?

2) The potential communal rewards of the research. Are the risks of the research worthwhile?

3) The degree to which the scientists on the board could see problems with the stated hypothesis being more general than the proposed population would support.

The reason the risks and rewards were relevant to the board were because, if the risks were lower, the selection criteria standards could be lower. Likewise, if the potential rewards to the community were higher and the risks were low, the selection criteria would also be less stringent. But anything with high risk or low potential benefit got run through the absolute ringer if they tried some version of the things you wrote out above: “The real studies by professional scientists usually use Psych 101 students at the professional scientists’ university. Or sometimes they will put up a flyer on a bulletin board in town, saying “Earn $10 By Participating In A Study!”” Perhaps things are run differently elsewhere, but that kind of thing definitely did not pass my university's IRB.

All that to say, psych studies (generally speaking) were considered by the IRB to be quite low on the invasiveness scale, but were considered of good potential value to the community, so the selection standards were…not high. While this might make for a bunch of interesting published results, I don’t know that you could, by default, argue that any of the results of the psych research at the school would translate outside of the communities the researchers were evaluating. Correlations between groups of people are super-valuable, of course, but if you're looking for more "scientific," quantifiable data to apply at huge scale, that's just not the place to find it.

That’s an important distinction, though. Because the board would have to ask the researchers to correct the ‘scale’ of their hypothesis on almost all the psych research proposals that I sat in on, as researchers were usually making grand, universal (or at least national) statements, but usually only testing very locally.

I think our IRB had the correct approach on this. And I think it’s one that others should use (including Scott). To be clear, I think this self-critique and examination is something that Scott does regularly, judging from his writing. I also think he’s aware of the “selection bias” in his polls, such that it is, in that he tries to spot easily identifiable ways in which his demographic’s results might not translate to a broader community. That doesn’t mean he always sees it accurately, but I've seen the attempts enough times to respect it.

However, from what I’ve seen, most researchers don’t have the instinct to try to find fault or limitations to their own research's relevance. Perhaps that’s just my experience of seeing so many first-time graduate researchers come through the IRB, though. I don’t have any idea who Aella is, so perhaps I’m missing something, but it seems to me that stating the possible limits of the applicability of your research should be pro forma, and I'd be very hesitant to play down the importance of that responsibility.

But I think I get where Scott is coming from. He's right that selection bias is a part of research that you can’t get rid of, and a lot of the established players seem to get away with it when "amatuers" get dismissed completely because of it. But I think it’s something that you should definitely keep in mind and try to account for in both your hypothesis and your results.

To steel-man Scott's point, I don't think he's arguing that selection bias can be a real problem (even at his most defensive, he just says it "can be" "fine-ish"). I think he's just trying to argue against using “selection bias” and “small sample size” as conversation-enders. Doesn't mean he's discounting them as real factors. But some people use them in a similar way that I often see people shout “that’s a straw man” or “that’s a slippery slope fallacy” online—to not have to think any further about the idea behind sometimes flawed reasoning. Those dismissive people often aren’t wrong in terms of identifying a potential problem, but they ARE wrong to simply dismiss/ignore the point of view being expressed because the argument used was poor. If I ignored every good idea after having heard it expressed/defended poorly, I wouldn't believe in anything.

In the same way, there's no reason to completely dismiss the value of any online poll, as long as you (the reader of a poll) have the correct limits on the hypothesis and don't believe any overstating of results.

Expand full comment

Well said.

Expand full comment

"However, from what I’ve seen, most researchers don’t have the instinct to try to find fault or limitations to their own research's relevance."

Alternatively, when applying for grants or trying to start a new big thing, researchers are often very much encouraged by grant commitee to oversell their projects, which does not incite to openly discuss the projects' limitations.

Expand full comment

Ding ding ding. You make everything a race, you end up with people focused on speed and not safety, even if you claim to be very very concerned about safety. Especially if there are few penalties for mess ups.

Expand full comment

There's nothing wrong with giving your blue-sky hopes when pitching the grant application. Why not? The program manager definitely wants to know what the best-case outcome might be, because research is *supposed* to be bread on the waters, taking a big risk for a potential big outcome.

I think the discussion here is what goes into your paper reporting on the work afterward -- a very different story, where precision and rigor and not running ahead of your data are (or ought to be) de rigeur.

Now, if you are pitching the *next* grant application in your *current* paper, that's on you, that's a weakness in your ethics. Yes, I'm aware there's pressure to do so. No, that isn't the slightest bit of excuse. Withstanding that pressure is part of the necessary qualifications for being entrusted with the public's money.

Expand full comment

This seems a really naïve interpretation of how the process actually works and what the incentives are.

Expand full comment

It's certainly not naive, since I've been involved in it, on both ends, for decades. You may reasonably complain that it's pretty darn strict, but that doesn't bother me at all.

Being a scientist is a sweet gig, a rare lucky privilege, something that any regular schmo trying to sell cars, or cut an acre of grass in 100F Houston heat, or unfuck a stamping machine on an assembly line that just quit would give his eyeteeth to be able to do -- sit in a nice air-conditioned office all day, speculate about Big Things, write papers, travel the world to argue with other smart people about Big Things.

If you're independently wealthy and can afford to theorize about the infinite, then do whatever you damn well please. But if you are doing this on the public's dime -- on money sweated out of the car salesman, the gardener, or the machine shop foreman -- then yeah there are some pretty strict standards. If you don't like it, check out the Help Wanted ads in your hometown and do something for which someone will pay you voluntarily.

Expand full comment

My point would be that people very rarely live up to those standards in large part because the way science is done has the incentive structure all wrong. The naïveté I was talking about was you belief the system was working very well.

Expand full comment

Hear hear.

Expand full comment

If Scott had a list of work Aella has done related to something like banana eating I'd take his argument much more seriously. Using such a specific example seems like a bad idea in general for the purpose of this post and it seems like an even *worse* idea to use the specific specific example he used.

I honestly wonder if there is some sort of weird trick question/fakeout going on with this whole post just because of that.

Expand full comment

There are three main points to consider that I think you're glossing over.

1. This only matters if aella is *claiming* to represent a larger population than 'people who read my blog and people who are similar to them.' Having accurate data that only tells you about one thing rather than everything, is not the same as having inaccurate data. You would have to read each blog post to judge whether or not the results are being misreprepresented each time.

2. This only matters if the selection criteria and the thing being measured are correlated. You say that aella mostly asks about things relevant to their blog which implies there will be a correlation, which I'm sure is true for some things they measure, but won't be true for everything. Psychology studies are also not as stupid about this as people imagine, it is normal to just use psych students when you are studying low-level psychophysics concepts that should not correlate with college attendance, and to get a broader sample when studying social phenomena that will correlate.

3. This primarily matters if you are doing a descriptive survey of population counts and nothing else, and matters a lot less if you are looking at correlations between factors or building more complex models. For example, lets say that you were asking about sex positivity and lifetime number of partners; sure, you might plausibly imagine that both of those things are higher among aella's audience, so you wouldn't represent the simple counts as representative of the general population. But if you were asking what the *relationship* between those two factors is, there will still be variation within that sample that lets you tell the relationship, and there's no particular reason to believe that *how those factors vary together* is different in aella's audience than in the general population.

Expand full comment

I get tons of responses from people who have no idea who I am! People seem to not understand I do research that's not just twitter polls.

Expand full comment

What studies are good?

Expand full comment

If smart people eat bananas because they know they are good for their something something potassium then we should be skeptical about the causal language in your putative study title. Perhaps something more like "Study finds Higher IQ People Eat More Bananas" would be more amenable to asterisking caveats and less utterly and completely false and misleading.

Expand full comment

I realized someone would say this two seconds after making the post, so I edited in "(obviously there are many other problems with this study, like establishing causation - let’s ignore those for now)"

Expand full comment

Sorry for being predictable ;)

I think I was primed to have this concern because when I initially read "selection bias" my brain went right to "selection into treatment" (causality issues) rather than "sample selection bias".

Expand full comment

Same thought. Super surprising he did not say "Study finds positive correlation between IQ and banana consumption". I thought he was going to be funny and put both disclaimers in the asterisk.

Expand full comment

I think the real difference here is that the studies are doing hypothesis testing, while the surveys are trying to get more granular information.

I mean you have a theory that bananas -> potassium -> some mechanism -> higher IQ, and you want to check if it is right, so you ask yourself how does the world look different if it is right versus if it is wrong. And you conclude that if it is correct, then in almost any population you should see a modest correlation between banana consumption and IQ, whereas the null hypothesis would be little to no correlation. So if you check basically any population for correlation and find it, it is evidence (at least in the Bayesian sense) in favor of your underlying theory.

On the other hand, if you were trying to pin down the strength of the effect (in terms of IQ points/ banana/ year or something), then measuring a correlation for just psych 101 students really might not generalize well to the human population as a whole. In fact, you'd probably want to do a controlled study rather than a correlational one.

Expand full comment

This is a much better explanation than Scott's post. Very helpful comment

Expand full comment

That would work very nice if the actual steps indeed were:

1. formulate hypothesis

2. randomly pick a sample

3. make measurements

4. check if the hypothesis holds for that sample

But I suspect that often it's: 2 is done first, then 3, then 1, then 4.

Then, I think it doesn't work.

And there might some approaches along this spectrum where the hypothesis "shape" is determined first, but it has some "holes" to be filled later. They holes could range in size from "huge" like "${kind of fruit}" to "small" like "${size of effect}".

Expand full comment

I mean if you formulate your hypothesis only after gathering your data and aren't correcting properly for multiple hypothesis testing or aren't using separate hypothesis-gathering-data and hypothesis-testing-data, then you are already doing something very very wrong.

But I think that there are probably lots of things you might want to test for where whether or not the effect exists is relatively stable group to group, but details like the size of the effect might vary substantially.

Expand full comment

That way lies the "green jellybean" effect.

If you have a bunch of data points, you can always find an equation to produce that collection within acceptable error bounds. Epicycles *do* accurately predict planetary orbits. And with enough creativity I'm sure they could handle relativity's modification of Mercury's orbit.

Expand full comment

Isn't the green jellybean effect like 30% of modern "research"?

Expand full comment

This doesn't sound quite right to me. I would say the main difference not to be the granularity, but that "correlation studies" look for x -> y, where both x and y are measured within persons, while polls look x between/over different persons. Thus in the "correlation studies" the interesting thing happens (in a way) within each participant of the study; whether something they have is related to something else they have (not). Thus, who the participants are is less relevant, as they kinda work as a ""control"" for themselves.

In case it sounds like nitpicking, this difference is relevant in that my point leads to a different conclusion on the effect sizes. I see no reason to think that estimating effect sizes from psych students (generalized to the whole population) is more wrong than estimating the existence of an effect. The latter is really just a dichotomous simplification of the former ("it is more/less than 0"), and if we draw a conclusion that there is an effect, say, >0, we might as well try to be more nuanced of the size of it.

Because if you say that the effect size does not generalize, why would the sign of it (+ or - or 0) then generalize? Of course, it is more possible to be right when only saying yes or no, but there is no qualitative difference in trying to generalize an effect existing and trying to generalize the size of that effect. The uncertainty of the effect being -.05 or .05 is not really different from the uncertainty of the effect being .05 and .1

Expand full comment

I don't think that this holds up. You don't really have correlations within a single person unless you measure changes over time or something. Remember correlation is defined as:

([Average of X*Y] - [Average of X]*[Average of Y])/sqrt{([Average of X^2]-[Average of X]^2)([Average of Y^2]-[Average of Y]^2)}.

It's a big mess of averages and cannot be defined for an individual person. I suppose it is robust to certain kinds of differences between groups that you consider, but it is not robust to others.

My point is that if you are looking at a relatively big effect and want to know whether it is there or not, most groups that you look at will tell you it's there. However, if you want to know more accurately how big it is, sampling just from one group is almost certainly going to give you a biased answer.

Expand full comment

Yes the correlation is measured across the sample, but my point is that: the correlation is like testing whether BMI and running speed is correlated; and the poll-approach is like testing whether running speed is say on average more than 8km/h. The former works usually in a non-representational sample because it has two measurements for each participant, and is testing the relationship between them. The latter does not, as it tries to test some quality of the people as a whole.

I still claim that it has nothing to do with the granularity, you can answer both of those questions at different levels of exactness, but only one of them can give meaningful results on an unrepresentative sample.

Expand full comment

I think we're only seeing a difference in this example because the correlation between running speed and BMI is not close to 0 while the average running speed *is* close to 8km/h. If you were trying to test whether average running speed was more than 2km/h, you'd probably get pretty consistent answers independent of which group you measured.

Actually, measuring correlation over just a single group might even be less reliable because of Simpson's paradox. The correlation could be positive within every group and yet be negative overall.

Expand full comment

I agree that most people rush to "selection bias" too quickly as a trump card that invalidates any findings (up there with "correlation doesn't mean causation"). However, I disagree that "polls vs correlations" is the right lens to look through it (after all, polls are mostly only discovering correlations as well).

The problem is not the nature of the hypotheses or even the rigor of the research so much as whether the method by which the units were selected was itself correlated with the outcome of interest (i.e., selecting on the dependent variable). In those cases, correlations will often be illusory at best, or in the wrong direction at worst.

Expand full comment

I agree that "polls vs correlations" isn't right, but partly because I think "polls" are far more heterogeneous than Scott suggests. If you want to find out who's going to win an election, then you really care about which side of 50% the numbers are on. But if you're polling about something like "how many people support marijuana legalization?" then your question is more like "is it 20% or 50% or 80%?" and for this, something that has a good chance of being off by 10 is fine (as long as you understand that's what's going on).

Expand full comment

But off-by-10% is not that bad scenario, real life can be worse. You can be off by much much more. For example by polling the much-mentioned psych students, I know from results I've seen that you can get for example >40% support for a political party that is at 10% for the overall population. So the main difference is not he hypothesis, the main difference is whether you are looking population-level descriptive statistics or a link between two variables within people ("liberals like cheesecake more than conservatives").

Expand full comment

What do you all think about the dominance of Amazon’s Mechanical Turk in finding people for studies? Has it worsened studies by only drawing from the same pool over and over?

Expand full comment

Seems probably better than just using college students tbh - you get the same pool of students volunteering over and over for studies as well, and they are more demographic-restricted.

(Of course there is natural turnover in that most college students are only college students for 4-5 years - I wonder how that compares to the turnover in MTurk workers)

Expand full comment

There needs to be a rule that you can only volunteer for one paid psychology experiment in your lifetime.

I did a bunch of these when I was in school and you quickly realize that the researchers are almost always trying to trick you about something. It becomes a game to figure out what they're lying about and what hypothesis they're testing, and in most cases that self-awareness will ruin the experiment.

Expand full comment

That is not true of most psychology experiments. For a meaningful portion of them (say 20% or something), sure, but definitely not most. But sure, this self-awareness can be a big problem (although I have participated multiple studies which included cheating, and I don't think it affected me in any of them, maybe I'm just a gullible person).

Expand full comment

"Selection bias is fine-ish if..."

I'm interpreting this as saying that one's prior on a correlation not holding for the general population should be fairly low. But it seems like a correlation being interesting enough to hear about should be a lot of evidence in favour of the correlation not holding, because if the correlation holds, it's more likely (idk by how much, but I think by enough) to be widely known -> a lot less interesting, so you don't hear about it.

As an example, I run a survey on my blog, Ex-Translocated, with a thousand readers, a significant portion of which come from the rationality community. I have 9 innocuous correlations I'm measuring which give me exactly the information that common sense would expect, and one correlation between "how much time have you spent consuming self-help resources?" and "how much have self-help resources helped you at task X?" which is way higher than what common sense would naively expect. The rest of my correlations are boring and nobody hears about them except for my 1,000 readers, but my last correlation goes viral on pseudoscience Twitter that assumes this generalises to all self-help when it doesn't and uses it to justify actually unhelpful self-help. (If you feel the desire to nitpick this example you can probably generate another.)

I agree that this doesn't mean one ought to dismiss every such correlation out of hand, but I feel like this does mean that if I hear about an interesting survey result's or psych study's correlation in a context where I didn't also previously hear about the survey/study's intention to investigate said correlation (this doesn't just require preregistration because of memetic selection effects), I should ignore it unless I know enough to speculate as to the actual causal mechanisms behind that correlation.

This pretty much just bottoms out in "either trust domain experts or investigate every result of a survey/every study in the literature" which seems about right to me. So when someone e.g. criticises Aella for trying to run a survey at all to figure things out, that's silly, but it's also true that if one of Aella's tweets talking about an interesting result goes viral, they should ignore it, and this does seem like the actual response of most people to crazy-sounding effects; if anything, people seem to take psych studies too seriously rather than not taking random internet survey results seriously enough.

Expand full comment

It's a good point. But it seems this applies equally to psych studies, so it doesn't weaken Scott's point that we shouldn't single out internet surveys as invalid.

Expand full comment

Like any kind of bias, selection bias matters when the selection process is correlated with BOTH the independent and dependent variables and as such represents a potential confounder. Study design is how you stop selection bias from making your study meaningless.

Expand full comment

The way I think about the key difference here (which I learned during some time doing pharma research, where this kind of issues are as bad as... well) is that when claiming that a correlation doesn't generalize, some of the *burden of proof* shifts to the person critizicing the result. Decent article reviewers were pretty good at this: giving an at least plausible-sounding mechanism by which when going to a different population there's som *additional* effect to cancel/revert the correlation. It's the fact that the failure of correlation requires this extra mechanism that goes against Occam's Razor.

Expand full comment

It's not about correlations, it's about the supposed causal mechanism. Your Psych 101 sample is fine if you are dealing with cognitive factors that you suppose are universal. If you're dealing with social or motivational ones, then you're perhaps going to be in danger of making a false generalization. This is particularly disastrous in educational contexts because of the wide variety of places and populations involved in school learning. It really does happen all the time, and the only solution is for researchers to really know the gamut of contexts (so that they realize how universal their mechanisms are likely to be) and make the context explicit and clear instead of burying it in limitations (so that others have a chance to catching them on an over-generalization, if there is one). Another necessary shift is for people to simply stop looking for universal effects in social sciencies and instead expect heterogeneity.

Expand full comment

“But real studies by professional scientists don’t have selection bias, because . . . sorry, I don’t know how their model would end this sentence.”

...because they control for demographics, is how they’d complete the sentence.

Generically, we know internet surveys are terrible for voting behavior. Whether they’re good for the kinds of things Aella uses them for is a good question!

I’m on the record in talks as saying “everything is a demand effect, and that’s OK.” I see surveys as eliciting not what a person thinks or feels, but what they are willing to say they think and feel in a context constructed by the survey. Aella is probably getting better answers about sexual desire (that’s her job, after all!) and better answers on basic cognition. Probably worse on consumer behavior, politics, and generic interpersonal.

Expand full comment

Here is a randomly selected study from a top psychiatry journal, can you explain in what sense they are "controlling for demographics"? They have some discussion of age but don't even mention race, social class, etc.

https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.2020.19080886

Expand full comment

That’s an RCT, not a survey, and it’s probably more useful for them to run with the selection effects that get people into the office, rather than attempt to determine what would work for an unbiased sample of the population at large.

Expand full comment

I'm not sure why the RCT vs. survey matters for this purpose. Randomization only guarantees that people don't have extra confounders aside from the ones that brought them into the study, it doesn't address selection bias in getting into the study itself.

If I did an RCT of ACX readers, where I artificially manipulated the mental health of one group (by giving them addictive drugs, say), and then tested which group had more spiritual experiences, they would still be ACX readers, different from the population in all the usual ways.

Nor does it have to do with "getting them into the office". You find the exact same thing in nonclinical psychology studies - for example, can you find any attempt to control for demographics or extend out of sample in https://asset-pdf.scinapse.io/prod/2001019597/2001019597.pdf (randomly selected study that appeared when I Googled "implicit association test", a randomly selected psych construct that came to mind).

Maybe it would be more helpful if you posted a standard, well-known psychology paper that *did* do the extension out of sample as a routine part of testing a psychological construct. I think I've never seen this and would be interested to know what you're thinking of.

Expand full comment

I’m kind of going study by study as you post them. The RCT you suggest would be great if you were proposing a treatment for ACX readers, but I think you’d agree that results would be much weaker if you were proposing to extend to a large inner city hospital. Correlations in one population can reverse in another, famously from collider bias!

An example from Aella: I would not be surprised if Aella finds a negative correlation between kinks that is positive in the general population. This would be driven by a non-linearity; mildly kinky people are in to lots of things—so positive correlation between kinks in general population. Very kinky people have an obsession that excludes others (so weak or negative, in a survey that selects for people by interest in a sex worker feed.)

In an academic psych talk, a query about selection bias might receive the response “we corrected för demographics”, or it might get the response “what confound do you have in mind?” In political science I think the demographic question is more salient, because they have a (rather imaginary) “polis” in mind.

IMO the most interesting questions involve an interaction between group and individual, so the most interesting work is asking “what happens for people in group X”, and talus about both the cognitive mechanisms and how they interact with the logic of the group.

Expand full comment

Sorry, I do want to stick with the original thing we were discussing, which is whether most real scientists control for demographics. As far as I can tell, this still seems false. Do you still believe it is true? If so, can you give me examples? If we now agree this is false I'm happy to move on to these other unrelated points.

Expand full comment

I'm a psychology PhD and in my experience at least ~1/4 of the time reviewers will ask you to control for basic demographics like gender/race/age. (And lots of papers will throw this in there as a supplementary analysis.) If this seems incongruent with other people's experience, I can pull up a few random papers & see if I'm right...

And I agree with Simon that "did you control for gender/race/etc?" or "does this differ across genders/etc?" are common questions in talks.

Expand full comment

No apology necessary! You’re asking a science-of-science question which is best settled not by our anecdotes, but a quantitative survey of the literature. My suggestion, ironically enough, is to control, if you do such a survey, for selection bias.

Expand full comment

If you did the RCT of ACX readers and you found that giving addictive drugs led to more spiritual experiences, you could be pretty confident that it was true that the type of people who read ACX are more likely to have mystical experiences if they take addictive drugs. You'd then have to wonder whether this is only true for the type of people who read ACX, but you'd at least have solid ground on which to generalize from.

Without the RCT, you might find a relationship between drug use and spiritual experiences because drug use and spiritual experiences influence readership of your blog. In that case, your correlation might not even hold for "the type of people who read ACX"--just for actual ACX readers.

You're a psychiatrist who writes about drugs, ad you're also an irreligious rationalist. Let's suppose that your irreligious rationalism makes you less appealing to people who are more likely to be spiritual, but your drug writing makes you more appealing to people who are more into drugs. Together, that might mean that spiritual experience people who don't like drugs tend to stop reading your blog, but spiritual experience people who like drugs stick around, while non-spiritual people stick around regardless of whether they like drugs.

If so, you'd end up finding a positive correlation between drugs and spiritual experiences when looking at blog readers, even if that correlation doesn't exist for people with low interest in spirituality (as a group), or for people with a high interest in drugs (as a group), or people with a high IQ, or any other way of characterizing your readership. The correlation would be entirely an artifact of the selection process into becoming a blog reader.

Expand full comment

I think a bit part of the argument is "what is a spiritual/mystic experience?"

Tweeter didn't define it in that tweet, and that does make a difference.

I have had profound feelings of awe and gratitude at the beauty of the universe, but I would not define those as "mystic". Now, if Tweeter does say "but that is a mystic experience!", then we can begin to arrive at some kind of definition: a 'healthy' mind will have feelings of 'more than the usual grind or the rat-race'.

So everyone who has had the oceanic feeling can agree that they have had it, whether or not you define that as spiritual/mystic, and *then* we can ask "so how is your mental health?" and correlate one with the other.

If "excellent mental health" and "regularly experience of profundity" go together significantly, then Tweeter has made their case.

As it is, it's just more "Eat Pray Love" tourism showing-off about being *so* much finer material than the common clay normies.

The Hopkins poem "God's Grandeur" speaks to me, but that is because (1) we're co-religionists so I get where he's coming from and (2) I too have had experiences of the beauty of the world, despite all the evil and pain and suffering, but I would not necessarily call those mystic or spiritual:

https://www.poetryfoundation.org/poems/44395/gods-grandeur

Expand full comment

Nope, that matters enormously.

RCT vs. survey does make a critical difference as soon as you are secretly thinking about some causal nexus the correlation might somehow suggest even if not quite imply, so bacically always.

With an extra assumption that the causal mechanisms work the same for the entire population, the RCT on the unrepresentative sample is actually good evidence for some kind of causal connection in the underlying population. And then of course you get to speculate on direction of causality, common causes, multiausality and so on.