DARE has a positive effect size for keeping kids off drugs? I realize this isn't the point of the article, but I thought I remembered it as being negative (and significant).

These standardized effect sizes are pretty useless things to report.

The context for an effect size is the subject matter. Normalizing in the way you've done smooths over the variances of the populations as well as the size of the effect. Those things matter a TON when thinking about whether an effect is worth caring about!

Things like Coen's d can be useful for making comparisons within a specific context, but as you cross contextual boundaries (which is the stated purpose of your exercise here), it becomes actively detrimental by acting as a substitute for thinking about the actual subject matter.

After 20 years of disuse, I have lost all the statistical knowledge and ability I once had in college. However, this post reminds of a Twitter post of Taleb's from last year stating, "Perhaps the most misunderstood idea in the 'empirical' sciences is that a correlation of 90% is closer to 0 than 1." https://twitter.com/nntaleb/status/1553688258738995200?s=20. I did not really understand the post then, and I can't say I really understand it now, but I have a vague intuition this blog post is related to the concept Taleb was referencing?

The (one?) way to put a statistic in context is to show it being used in a cost benefit analysis, assuming it can be shown as cause. A "small" effect size might be important in deciding to deploy a vaccine or not. A large effect size might not be important in choosing your pet's food.

Out of curiosity: I suspect that northwestern European countries are a key reason why the correlation between temperature and latitude isn't higher. The British Isles, the Netherlands, Denmark, etc. are quite high latitude, but not particularly cold. Almost all of Europe is noticeably warmer than the equivalent latitudes in Asia and North America, and Europe packs a lot of data points (a lot of small-to-medium countries) into a small space.

>the correlation between which college majors have higher IQ, and which college majors have skewed gender balance

I looked at your link on this, and they admit, in a note at the bottom, that their source didn't actually look at IQ scores, but rather "pre-2011 GRE scores" (I'm not sure the significance of the year). Their source then did some kind of statistical translation to get approximate IQ scores from that. Anyway, this means that r value doesn't actually relate IQ to gender skew of majors.

I think my favourite new piece of information that I'll take away from this is that brain size correlates with IQ about as much as parental social class correlates with children's grades.

The latter is considered common wisdom, the former is considered phrenological claptrap, but they're both equally real.

Why not just write the correlation values directly? When writing about something like GDP growth, people usually feel no need to write in similes... Or maybe this is what you're saying?

The real issue with effect sizes is that they depend so much on error in measurement.

Switching a study to a more reliable (test retest) measurement would substantially increase the effect size, without changing the real size of the effect at all.

So, this makes it even easier to cheat. Just find a field where the effects are expected to be small, but where the measurement happens to be good.

The specific example of human height differences is a psychologically weird one, because we keep our eyes close to the tops of our bodies and assign high salience to looking slightly up or down. A 5'6" person barely comes up to the chin of a 6'2", making them feel *dramatically* shorter, rather than about 90% as tall like a 66-inch-long table vs. a 74-incher.

People are constantly meeting men and women and noticing which is taller, so I would use male vs. female height if I wanted to draw attention to pairwise comparisons (i.e. the probability that a random person from the SSRI group is feeling better than a random placebo taker); this seems to have been the intention and can be a good way to look at what Cohen's D is measuring. But it's not necessarily clarifying to talk about the absolute difference ("just a couple of inches") when these particular inches are perceived so differently from inches of table or even inches of inseam.

> I sort of agree, but I can see some use for these. We can’t express how much less depressed someone gets in comprehensible units - it will either be stuff like this or “5 points on the HAM-D”, which is even more arcane and harder for laymen to interpret.

So _how_does the layman interpret this? If he takes the effect size as “calibrated relative to the variation I experience in my own moods”, he’s just wrong. If he understands enough to take it as “calibrated relative to population variation on some measurement scale”, AND he also has some idea what this population variance is and what the measurement scale means, he’s not a layman (neither as a statistician or a psychiatrist).

It seems to me that presenting effect size to a layman is not a problem that always has an easy answer. And often the best answer you can give (if any) will be domain dependent and may even (horror of horrors) require you to educate your layman a bit first. It's not going to come from pressing a generic button applying to a table of numbers. Even if the output has a label saying 'effect size' suggesting it is answering your problem. Even if you can't think of anything else. Maybe that's just the way of things.

Since it's become my thing, I suppose I'll just put it out there: IQ is a test score, not a thing that can "determine" other things. It's an effect, not a cause.

I know it's conventional to discuss correlations and R2 in terms of "explaining x% of y effect" but I hate that convention, because tautologies aren't explanations but statistics don't tell you about endogeneity by themselves. You have to use those hated CONCEPTS instead of nice clean numbers. Ugh! What am I, a filthy English major?

As a sympathetic social scientist and media theorist, I feel that your intellectual honesty and the community's fundamental optimism is approaching the absurd. My conclusion is the same as your statistician friends', from the opposite direction.

The intellectual move you're making here is indeed *the kind of thing* that must be done in order to have the kind of discussions -- here defined as exchanges of typed text and hyperlinks in general pseudonymous web forums -- that we want to have. The issue is illustrated by dwelling on the phrase "put statistics in context."

"Context" here means the typed text (or perhaps spoken language) that is the medium of communication. This medium is linear, logical, progressive -- and extremely old.

"Statistics" are a *radically* new media technology. 99% of all statistics were calculated in the past 10 years (speculatively). The methodology used to produce these statistics is both highly varied and rapidly changing. I wanted this sentence to be about what statistics "mean," but I can't think of a sentence that could accomplish this: it's hard to even define "statistics"!

This post asks the question: "Given that we are committed to combining the media technology of typed paragraphs (with hyperlinks) with statistics, how can we best do this?" Again: noble! honest! but my conclusion is that the conclusion reached in this post is an indictment of the premise, not of the logic.

To rephrase the problem information-theoretically: How much information is necessary to put a statistic in context? This formulation emphasizes that "in context" implies a binary -- and if we're going to impose a binary filter at the end of our knowledge communication, why bother with these continuous statistics?

The full inversion of the logic of the post, then, is the question: how much information can possibly be conveyed in a phrase like "about the same as the degree to which the number of spots on a dog affects its friskiness”? And it seems like the answer is: not enough!

But men and women differ by much more than a couple of inches (= 5 cm). According to Swedish statistics, the difference is 14 cm (180 cm vs 166 cm). If the difference were only 5 cm, we'd see a lot more overlap.

IMO, the problem come from trying to reduce everything to a single number (or small set of numbers). A picture is worth a thousand words. Show a graph of the distribution of grades under teaching method A and under teaching method B (on the same graph). Eyeballing that picture will immediately tell you how much better one method is than another (if at all)

Galton introduced the correlation coefficient as recently as 1888, a couple of centuries after Newton's big book. I've often been struck by how long it took humanity to get interested in statistics compared to the harder subject of physics.

It's not as if people didn't have any data to work with in the past: the Bible mentions at least three censuses and even has a "Book of Numbers." But the urge to nerd out with data seems to be fairly recent: e.g., William Playfair invented most of the main types of statistical graphs in the late 18th Century.

An interesting question is whether humans just innately aren't that interested in or adept at statistical thinking. Or is this more a socially constructed problem which could improve over time?

Even though I’m generally aware that human attributes come normally distributed, I don’t have a general intuition for the standard deviation of each. For example, how much smarter is a 1 in 100 person than the average person, compared to how much more attractive a 1 in 100 person is compared to the average. Any thoughts on this?

Similar to Mantic Monday, it would useful to get a Fact Friday or something like that with a list similar to the one at the end of this post. It would help develop everyone’s intuition.

Median effect size in Psychology papers is r = 0.36 but r = 0.16 is the study is pre-registered. Either we need to change public understandarding of what is a good correlation or we need to admit that academics are too interested in things that don't matter.

Another potentially misleading thing is what you control for. Like if you say the "correlation between life outcomes and education is low" that sounds very significant until you add "controlling for income", as income mediates most of the effect you're interested in. (That's a made up example so may not reflect the real stats).

Idea: for each quantity of interest, familiarize some people with the nature of the quantity, and have them all come up with a proposal for what the smallest difference is that they would consider meaningful. Call this a "Small Meaningful Difference".

Then instead of reporting effect sizes in terms of variance-standardized numbers, report them in terms of SMDs.

Scott, correlation does not equal identity. Just because students who are good at reading are also good at math, it doesn't follow that there is a "general" intelligence. There could be multiple intelligences that correlate.

I have a hard time understanding how you can be so bought-into the g factor when you yourself are 1-in-a-million talented at writing while worse than most of your peers at math. How does a "general" intelligence explain this?

Just let go of general intelligence and simply say that different talents correlate very strongly (without being literally the same thing as each other).

"And the second effect might sound immense, but it is only r = 0.64." This result appears to be limited to the USA - where New Orleans is a 29 degrees north and Duluth is 46 degrees north. So there is a real range restriction issue.

I smell a rat in your countries near the equator v country hotness correlation. Not nearly high enough.

Chasing the reference leads to:

9. Nearness to the equator and daily temperature in the U.S.A. (National Oceanic and Atmospheric

Administration, 1999; data reflect the average of the daily correlations for latitude with maximum

temperature and latitude with minimum temperature across 1 87 U.S. recording stations for the

time period from January 1, 1970, to December 3 1 , 1 996).

Which is not quite the same thing! That's states on a continental landmass, and it doesn't include the extremes of 'on the equator' or 'near the poles'.

Any figures for variance need to consider how many other things are allowed to vary. In a world where everyone was raised in a 100% identical environment, IQ would explain 100% of the variance; in a world where everyone had identical IQ, IQ would explain none of the variance.

Explaining variance almost never explains what you really want to know.

Expert tutoring was about 0.8 d, but looks like AI tutoring is about the same, and this study is from 2011. So using GPT4 tutoring will surely be better, and we basically don't need teachers much anymore for most stuff. Humans need not apply.

"We can’t express how much less depressed someone gets in comprehensible units - it will either be stuff like this or “5 points on the HAM-D”, which is even more arcane and harder for laymen to interpret."

Is there really no useful equivalent of mean/median/mode? No way we can just indicate "What we call depression isn't fundamentally just one thing, and people differ in ways we can't currently predict in advance, even though symptoms cluster. So, responses are highly variable, and most people won't respond, but if this helps you at all, it will help about this much"? Do our institutions not have the ability in principle to report and interpret an ordered triplet of numbers?

By the time I was in 5th grade my math tests had questions that asked questions like, "Which measure of central tendency best describes this data?" And then the data would be a list like "1,1,1,1,1,2,50" or "1,2,5,6,7,7,8,8,12." I realize that real-world data and its use are far more complicated, and there's value in having a common metric across many data sets, but still, this is the best we've got?

>The correlation between reading scores and math scores is lower than the correlation between which college majors have higher IQ, and which college majors have skewed gender balance.

Am I the only one that experiences this as a wrenching, mind-jarring typo?

I don't even know if it's grammatically incorrect technically, but it feels unpleasant to read.

The infinite boundary between mathematics and psychology.

On the correlation of 0.59 between IQ and grade, this actually says nothing about the effect of IQ on grades since grade is a variable that is highly subject to the bias inherent in the method of assessing it. If the assessment method is heavily recall-based, then the correlation with IQ would necessarily be moderate. But if the assessment method is heavily logic-based or critical-thinking-oriented, then the correlation would be very high. Hence, I would expect IQ/Grade correlation to be very high in STEM subjects, moderate in Arts, and low in social sciences.

I'm not sure it matters, but nothing is sourced, so it's all anecdotal. It's like astrology from Parade magazine, or Harpers' fractured anecdotes. AI could have written it.

> The correlation between reading scores and math scores is lower than the correlation between which college majors have higher IQ, and which college majors have skewed gender balance.

Scott, this has been commented on, but I'm curious what you even mean here. Reading scores and math scores are both continuous variables that have a total order. It is obvious what correlation is in this case, just Pearson's coefficient.

But what is a correlation between an unordered nominal variable like "college major" and continuous variables like IQ and gender balance? There is a good discussion of this topic here: https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable#124618. There are possible approaches, but it's not clear how you would meaningfully compare these in terms of predictive strength to something more straightforward like Pearson's coefficient.

One general strategy in comparing models in machine learning is to always have a baseline (that you're trying to beat on some performance benchmark) and, depending on the problem, an "oracle", which represents a meaningful upper bound (typically human performance or a model that somehow cheats by having access to additional information that would not be realistically available to a model used in the real world). The key is to put predictive performance into meaningful context by comparing with end points in the same problem space rather than cherry picking performance on other data sets to compare with (which clearly can be useful in some contexts, but unhelpful if the two distributions are completely unrelated, e.g., SSRI effect sizes vs. male-female height differences). I think in the case of the example given of depression, laypeople won't typically understand "2 points on the HAM-D," but will understand "15% of the difference between the means of the healthy/typical and depressed populations". Generally, this approach of making comparisons within the same problem space in this way is also constructive: it naturally points at "here's where we are vs. where we were vs. where we want to go" in a way that comparing totally unrelated distributions won't.

I think scatterplots are a good way to get a feel for what a correlation is really telling you. Here's a little graphic showing a bunch of correlations and scatterplots of data that produced them.:https://imgur.com/JuDikNA

Should these be reported without a confidence interval or standard deviation? For a lot of the more controversial things, what people usually doubt isn't whether the effect is positive but whether the effect is reliable.

Finally, a chance to share a funny anecdote that involves myself!

So I took the SAT twice because I applied to college early in Junior year of high school, but failed to get in. And I had to retake SAT the next year for the "real" college applications.

I used that one sketchy online table for converting one's SAT score(s?) to IQ score.

By taking the SAT twice, I """improved my IQ""" by about 10 points in 1 year.

I would have thought that comparing men and women's height would be the most familiar and thus best-all-around comparison, but on further thought I see three problems:

- Up until the end of high school, the sex difference in height isn't stable, so it can be confusing for, say, 10th graders taking stats.

- Among adults, women often wear heels in public, so many people might underestimate the gap.

- Lots of people these days will consider mentioning the difference to be sexist.

Are there any other go-to comparisons that don't have problems like these?

"Children tutored individually learn more than in a classroom" I thought the consensus was that while Bloom did find d = 2.0 in his original studies, later studies with larger sample size and better methodology got closer to d = 0.5 - 0.8 for tutoring, which is still amazing for an educational intervention but not quite as extreme?

A confounding variable here is that Bloom was not just measuring "human tutoring" but "human tutoring with the Mastery Learning method". Hattie, which you also linked, has "Mastery Learning" at d = 0.57 even without the one-on-one tutoring part.

A good starting point for this is https://nintil.com/bloom-sigma/ which also links to VanLenh (Ed Psy 2011) https://www.public.asu.edu/%7Ekvanlehn/Stringent/PDF/EffectivenessOfTutoring_Vanlehn.pdf that discusses the two sigma effect from P. 210. Key quote: "At any rate, the 1.95 effect sizes of both the Anania study and first Evens and Michael study [on which Bloom's analysis is based] were much higher than any other study of human tutoring versus no tutoring. The next highest effect size was 0.82. In short, it seems that human tutoring is not usually 2 sigmas more effective than classroom instruction, as the six studies presented by Bloom (1984) invited us to believe. Instead, it is closer to the mean effect size found here, 0.79. This is still a large effect size, of course."

And " ... Bloom’s 2 sigma article now appears to be a demonstration of the power of mastery learning rather than human tutoring ..."

Another educational point to beware of from the Hattie analysis linked in the post: "Whole-Language Instruction" has a positive d = 0.06. As "Sold a Story" explains, the Whole-Language people came up with a method, did studies, found statistical significance (I will be charitable and assume no p-hacking) and a positive effect size, and then went around promoting this as the Scientific Way (TM) to help disadvantaged kids to read.

The problem is that Phonics, which the Whole-Language people explicitly were out to replace, has d = 0.70.

Standardised effect sizes may be hard to derive meaning from in general, but "Phonics >>> Whole Language" for kids who don't pick up reading "naturally" seems to be established beyond reasonable doubt, and effect sizes are part of the evidence for that.

I get sent Bloom's two sigma paper approximately once a month. Unsurprisingly, it fails to replicate, although the literature is a mess. https://nintil.com/bloom-sigma/

Can you add some effect sizes for things we are pretty sure deterministically almost always work, to give an idea for how big the scale is supposed to go? Like, effect size for getting shot in the head to be lethal, or for a fresh-out-of-the-factory battery to cause an electric current in a circuit.

A little late to the party but this 2011 metareview finds that the effect size of individual tutoring over classroom learning is closer to .8 than 2.0, and that Bloom’s famous 2-sigma studies were outliers likely caused by factors other than the effect of one-on-one tutoring

Political liberalism vs. concern about COVID: 0.33 surprised me. Then I realised how being really shouty on social media can create a false impression around entire groups.

Lets say you have made a new teaching app, you try testing it in two ways. method A) you get the students to use the app and test them immediatly (while they are still in the psyc lab). method B) you test the participants by looking at improvement of their exam score at the end of the semester. Even if you get 0 drop-outs and the immediate-test and exam were the same you would expect method B to have a much lower effect size. When comparing effect sizes in education I often run into this problem, if rather than two methods of studying the same thing it was two studies, I get the problem of should I go with teaching intervention 1 which uses method A and has an effect size of 0.5 or should I use teaching intervention 2 which uses method B and has an effect size of 0.2 --- from this data I have almost no usable information about which one to pick.

StrongerByScience released a post that's an instance of this - the researchers said "tiny effect size of using creatine for muscle growth," but the research really says "makes your workouts 1/3 more effective," which is pretty huge! https://www.strongerbyscience.com/creatine-effect-size/

## Attempts To Put Statistics In Context, Put Into Context

DARE has a positive effect size for keeping kids off drugs? I realize this isn't the point of the article, but I thought I remembered it as being negative (and significant).

These standardized effect sizes are pretty useless things to report.

The context for an effect size is the subject matter. Normalizing in the way you've done smooths over the variances of the populations as well as the size of the effect. Those things matter a TON when thinking about whether an effect is worth caring about!

Things like Coen's d can be useful for making comparisons within a specific context, but as you cross contextual boundaries (which is the stated purpose of your exercise here), it becomes actively detrimental by acting as a substitute for thinking about the actual subject matter.

edited Jun 8After 20 years of disuse, I have lost all the statistical knowledge and ability I once had in college. However, this post reminds of a Twitter post of Taleb's from last year stating, "Perhaps the most misunderstood idea in the 'empirical' sciences is that a correlation of 90% is closer to 0 than 1." https://twitter.com/nntaleb/status/1553688258738995200?s=20. I did not really understand the post then, and I can't say I really understand it now, but I have a vague intuition this blog post is related to the concept Taleb was referencing?

The (one?) way to put a statistic in context is to show it being used in a cost benefit analysis, assuming it can be shown as cause. A "small" effect size might be important in deciding to deploy a vaccine or not. A large effect size might not be important in choosing your pet's food.

edited Jun 8Out of curiosity: I suspect that northwestern European countries are a key reason why the correlation between temperature and latitude isn't higher. The British Isles, the Netherlands, Denmark, etc. are quite high latitude, but not particularly cold. Almost all of Europe is noticeably warmer than the equivalent latitudes in Asia and North America, and Europe packs a lot of data points (a lot of small-to-medium countries) into a small space.

This was a fun one

>the correlation between which college majors have higher IQ, and which college majors have skewed gender balance

I looked at your link on this, and they admit, in a note at the bottom, that their source didn't actually look at IQ scores, but rather "pre-2011 GRE scores" (I'm not sure the significance of the year). Their source then did some kind of statistical translation to get approximate IQ scores from that. Anyway, this means that r value doesn't actually relate IQ to gender skew of majors.

I think my favourite new piece of information that I'll take away from this is that brain size correlates with IQ about as much as parental social class correlates with children's grades.

The latter is considered common wisdom, the former is considered phrenological claptrap, but they're both equally real.

One of my favorite short reads of the year 😊Thanks as always.

Is spots on dogs correlated to friskiness?

I know spots on foxes is correlated to gentleness, and spots on horses anti-correlated to gentleness.

edited Jun 8Why not just write the correlation values directly? When writing about something like GDP growth, people usually feel no need to write in similes... Or maybe this is what you're saying?

The real issue with effect sizes is that they depend so much on error in measurement.

Switching a study to a more reliable (test retest) measurement would substantially increase the effect size, without changing the real size of the effect at all.

So, this makes it even easier to cheat. Just find a field where the effects are expected to be small, but where the measurement happens to be good.

The specific example of human height differences is a psychologically weird one, because we keep our eyes close to the tops of our bodies and assign high salience to looking slightly up or down. A 5'6" person barely comes up to the chin of a 6'2", making them feel *dramatically* shorter, rather than about 90% as tall like a 66-inch-long table vs. a 74-incher.

People are constantly meeting men and women and noticing which is taller, so I would use male vs. female height if I wanted to draw attention to pairwise comparisons (i.e. the probability that a random person from the SSRI group is feeling better than a random placebo taker); this seems to have been the intention and can be a good way to look at what Cohen's D is measuring. But it's not necessarily clarifying to talk about the absolute difference ("just a couple of inches") when these particular inches are perceived so differently from inches of table or even inches of inseam.

> I sort of agree, but I can see some use for these. We can’t express how much less depressed someone gets in comprehensible units - it will either be stuff like this or “5 points on the HAM-D”, which is even more arcane and harder for laymen to interpret.

So _how_does the layman interpret this? If he takes the effect size as “calibrated relative to the variation I experience in my own moods”, he’s just wrong. If he understands enough to take it as “calibrated relative to population variation on some measurement scale”, AND he also has some idea what this population variance is and what the measurement scale means, he’s not a layman (neither as a statistician or a psychiatrist).

It seems to me that presenting effect size to a layman is not a problem that always has an easy answer. And often the best answer you can give (if any) will be domain dependent and may even (horror of horrors) require you to educate your layman a bit first. It's not going to come from pressing a generic button applying to a table of numbers. Even if the output has a label saying 'effect size' suggesting it is answering your problem. Even if you can't think of anything else. Maybe that's just the way of things.

Since it's become my thing, I suppose I'll just put it out there: IQ is a test score, not a thing that can "determine" other things. It's an effect, not a cause.

I know it's conventional to discuss correlations and R2 in terms of "explaining x% of y effect" but I hate that convention, because tautologies aren't explanations but statistics don't tell you about endogeneity by themselves. You have to use those hated CONCEPTS instead of nice clean numbers. Ugh! What am I, a filthy English major?

As a sympathetic social scientist and media theorist, I feel that your intellectual honesty and the community's fundamental optimism is approaching the absurd. My conclusion is the same as your statistician friends', from the opposite direction.

The intellectual move you're making here is indeed *the kind of thing* that must be done in order to have the kind of discussions -- here defined as exchanges of typed text and hyperlinks in general pseudonymous web forums -- that we want to have. The issue is illustrated by dwelling on the phrase "put statistics in context."

"Context" here means the typed text (or perhaps spoken language) that is the medium of communication. This medium is linear, logical, progressive -- and extremely old.

"Statistics" are a *radically* new media technology. 99% of all statistics were calculated in the past 10 years (speculatively). The methodology used to produce these statistics is both highly varied and rapidly changing. I wanted this sentence to be about what statistics "mean," but I can't think of a sentence that could accomplish this: it's hard to even define "statistics"!

This post asks the question: "Given that we are committed to combining the media technology of typed paragraphs (with hyperlinks) with statistics, how can we best do this?" Again: noble! honest! but my conclusion is that the conclusion reached in this post is an indictment of the premise, not of the logic.

To rephrase the problem information-theoretically: How much information is necessary to put a statistic in context? This formulation emphasizes that "in context" implies a binary -- and if we're going to impose a binary filter at the end of our knowledge communication, why bother with these continuous statistics?

The full inversion of the logic of the post, then, is the question: how much information can possibly be conveyed in a phrase like "about the same as the degree to which the number of spots on a dog affects its friskiness”? And it seems like the answer is: not enough!

But men and women differ by much more than a couple of inches (= 5 cm). According to Swedish statistics, the difference is 14 cm (180 cm vs 166 cm). If the difference were only 5 cm, we'd see a lot more overlap.

IMO, the problem come from trying to reduce everything to a single number (or small set of numbers). A picture is worth a thousand words. Show a graph of the distribution of grades under teaching method A and under teaching method B (on the same graph). Eyeballing that picture will immediately tell you how much better one method is than another (if at all)

There's a broken link in the "Children tutored individually learn more than in a classroom: 2.0" line (it looks like a typo; the intended link seems to be https://en.m.wikipedia.org/wiki/Bloom%27s_2_sigma_problem )

Possibly dumb question, but I was of the impression that effect size was between -1 and 1?

Is there anyone who could give me a quick reminder of how to interpret or how is calculated an effect size that goes up to 2.0?

Galton introduced the correlation coefficient as recently as 1888, a couple of centuries after Newton's big book. I've often been struck by how long it took humanity to get interested in statistics compared to the harder subject of physics.

It's not as if people didn't have any data to work with in the past: the Bible mentions at least three censuses and even has a "Book of Numbers." But the urge to nerd out with data seems to be fairly recent: e.g., William Playfair invented most of the main types of statistical graphs in the late 18th Century.

An interesting question is whether humans just innately aren't that interested in or adept at statistical thinking. Or is this more a socially constructed problem which could improve over time?

Even though I’m generally aware that human attributes come normally distributed, I don’t have a general intuition for the standard deviation of each. For example, how much smarter is a 1 in 100 person than the average person, compared to how much more attractive a 1 in 100 person is compared to the average. Any thoughts on this?

i feel like “5 points on the HAM-D” would be easier to interpret

Similar to Mantic Monday, it would useful to get a Fact Friday or something like that with a list similar to the one at the end of this post. It would help develop everyone’s intuition.

Median effect size in Psychology papers is r = 0.36 but r = 0.16 is the study is pre-registered. Either we need to change public understandarding of what is a good correlation or we need to admit that academics are too interested in things that don't matter.

https://www.frontiersin.org/articles/10.3389/fpsyg.2019.00813/full

Another potentially misleading thing is what you control for. Like if you say the "correlation between life outcomes and education is low" that sounds very significant until you add "controlling for income", as income mediates most of the effect you're interested in. (That's a made up example so may not reflect the real stats).

Idea: for each quantity of interest, familiarize some people with the nature of the quantity, and have them all come up with a proposal for what the smallest difference is that they would consider meaningful. Call this a "Small Meaningful Difference".

Then instead of reporting effect sizes in terms of variance-standardized numbers, report them in terms of SMDs.

My education is in engineering and physics. I have zero idea what these numbers mean. Maybe if you showed me some graphs that would be better.

Scott, correlation does not equal identity. Just because students who are good at reading are also good at math, it doesn't follow that there is a "general" intelligence. There could be multiple intelligences that correlate.

I have a hard time understanding how you can be so bought-into the g factor when you yourself are 1-in-a-million talented at writing while worse than most of your peers at math. How does a "general" intelligence explain this?

Just let go of general intelligence and simply say that different talents correlate very strongly (without being literally the same thing as each other).

"And the second effect might sound immense, but it is only r = 0.64." This result appears to be limited to the USA - where New Orleans is a 29 degrees north and Duluth is 46 degrees north. So there is a real range restriction issue.

I smell a rat in your countries near the equator v country hotness correlation. Not nearly high enough.

Chasing the reference leads to:

9. Nearness to the equator and daily temperature in the U.S.A. (National Oceanic and Atmospheric

Administration, 1999; data reflect the average of the daily correlations for latitude with maximum

temperature and latitude with minimum temperature across 1 87 U.S. recording stations for the

time period from January 1, 1970, to December 3 1 , 1 996).

Which is not quite the same thing! That's states on a continental landmass, and it doesn't include the extremes of 'on the equator' or 'near the poles'.

edited Jun 8Any figures for variance need to consider how many other things are allowed to vary. In a world where everyone was raised in a 100% identical environment, IQ would explain 100% of the variance; in a world where everyone had identical IQ, IQ would explain none of the variance.

Explaining variance almost never explains what you really want to know.

> only as big, in terms of effect size, as the average height difference between men and women - just a couple of inches

Average height difference between men and women is more like five inches. standard deviation about 3, d=1.6

>Children tutored individually learn more than in a classroom: 2.0

Is not actually true. Peer tutoring (students teach each other) is about 0.4 d

https://www.sciencedirect.com/science/article/pii/S2405844019361511

Expert tutoring was about 0.8 d, but looks like AI tutoring is about the same, and this study is from 2011. So using GPT4 tutoring will surely be better, and we basically don't need teachers much anymore for most stuff. Humans need not apply.

https://www.tandfonline.com/doi/abs/10.1080/00461520.2011.611369

"We can’t express how much less depressed someone gets in comprehensible units - it will either be stuff like this or “5 points on the HAM-D”, which is even more arcane and harder for laymen to interpret."

Is there really no useful equivalent of mean/median/mode? No way we can just indicate "What we call depression isn't fundamentally just one thing, and people differ in ways we can't currently predict in advance, even though symptoms cluster. So, responses are highly variable, and most people won't respond, but if this helps you at all, it will help about this much"? Do our institutions not have the ability in principle to report and interpret an ordered triplet of numbers?

By the time I was in 5th grade my math tests had questions that asked questions like, "Which measure of central tendency best describes this data?" And then the data would be a list like "1,1,1,1,1,2,50" or "1,2,5,6,7,7,8,8,12." I realize that real-world data and its use are far more complicated, and there's value in having a common metric across many data sets, but still, this is the best we've got?

The only context most anybody needs for anything statistical or involving studies is 'how much does this agree with what I already think'.

>The correlation between reading scores and math scores is lower than the correlation between which college majors have higher IQ, and which college majors have skewed gender balance.

Am I the only one that experiences this as a wrenching, mind-jarring typo?

I don't even know if it's grammatically incorrect technically, but it feels unpleasant to read.

The infinite boundary between mathematics and psychology.

On the correlation of 0.59 between IQ and grade, this actually says nothing about the effect of IQ on grades since grade is a variable that is highly subject to the bias inherent in the method of assessing it. If the assessment method is heavily recall-based, then the correlation with IQ would necessarily be moderate. But if the assessment method is heavily logic-based or critical-thinking-oriented, then the correlation would be very high. Hence, I would expect IQ/Grade correlation to be very high in STEM subjects, moderate in Arts, and low in social sciences.

I'm not sure it matters, but nothing is sourced, so it's all anecdotal. It's like astrology from Parade magazine, or Harpers' fractured anecdotes. AI could have written it.

> The correlation between reading scores and math scores is lower than the correlation between which college majors have higher IQ, and which college majors have skewed gender balance.

Scott, this has been commented on, but I'm curious what you even mean here. Reading scores and math scores are both continuous variables that have a total order. It is obvious what correlation is in this case, just Pearson's coefficient.

But what is a correlation between an unordered nominal variable like "college major" and continuous variables like IQ and gender balance? There is a good discussion of this topic here: https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable#124618. There are possible approaches, but it's not clear how you would meaningfully compare these in terms of predictive strength to something more straightforward like Pearson's coefficient.

One general strategy in comparing models in machine learning is to always have a baseline (that you're trying to beat on some performance benchmark) and, depending on the problem, an "oracle", which represents a meaningful upper bound (typically human performance or a model that somehow cheats by having access to additional information that would not be realistically available to a model used in the real world). The key is to put predictive performance into meaningful context by comparing with end points in the same problem space rather than cherry picking performance on other data sets to compare with (which clearly can be useful in some contexts, but unhelpful if the two distributions are completely unrelated, e.g., SSRI effect sizes vs. male-female height differences). I think in the case of the example given of depression, laypeople won't typically understand "2 points on the HAM-D," but will understand "15% of the difference between the means of the healthy/typical and depressed populations". Generally, this approach of making comparisons within the same problem space in this way is also constructive: it naturally points at "here's where we are vs. where we were vs. where we want to go" in a way that comparing totally unrelated distributions won't.

I think scatterplots are a good way to get a feel for what a correlation is really telling you. Here's a little graphic showing a bunch of correlations and scatterplots of data that produced them.:https://imgur.com/JuDikNA

It's from here: https://www.westga.edu/academics/research/vrc/assets/docs/scatterplots_and_correlation_notes.pdf

Should these be reported without a confidence interval or standard deviation? For a lot of the more controversial things, what people usually doubt isn't whether the effect is positive but whether the effect is reliable.

Finally, a chance to share a funny anecdote that involves myself!

So I took the SAT twice because I applied to college early in Junior year of high school, but failed to get in. And I had to retake SAT the next year for the "real" college applications.

I used that one sketchy online table for converting one's SAT score(s?) to IQ score.

By taking the SAT twice, I """improved my IQ""" by about 10 points in 1 year.

Great, super helpful. Thanks.

I would have thought that comparing men and women's height would be the most familiar and thus best-all-around comparison, but on further thought I see three problems:

- Up until the end of high school, the sex difference in height isn't stable, so it can be confusing for, say, 10th graders taking stats.

- Among adults, women often wear heels in public, so many people might underestimate the gap.

- Lots of people these days will consider mentioning the difference to be sexist.

Are there any other go-to comparisons that don't have problems like these?

"Children tutored individually learn more than in a classroom" I thought the consensus was that while Bloom did find d = 2.0 in his original studies, later studies with larger sample size and better methodology got closer to d = 0.5 - 0.8 for tutoring, which is still amazing for an educational intervention but not quite as extreme?

A confounding variable here is that Bloom was not just measuring "human tutoring" but "human tutoring with the Mastery Learning method". Hattie, which you also linked, has "Mastery Learning" at d = 0.57 even without the one-on-one tutoring part.

A good starting point for this is https://nintil.com/bloom-sigma/ which also links to VanLenh (Ed Psy 2011) https://www.public.asu.edu/%7Ekvanlehn/Stringent/PDF/EffectivenessOfTutoring_Vanlehn.pdf that discusses the two sigma effect from P. 210. Key quote: "At any rate, the 1.95 effect sizes of both the Anania study and first Evens and Michael study [on which Bloom's analysis is based] were much higher than any other study of human tutoring versus no tutoring. The next highest effect size was 0.82. In short, it seems that human tutoring is not usually 2 sigmas more effective than classroom instruction, as the six studies presented by Bloom (1984) invited us to believe. Instead, it is closer to the mean effect size found here, 0.79. This is still a large effect size, of course."

And " ... Bloom’s 2 sigma article now appears to be a demonstration of the power of mastery learning rather than human tutoring ..."

edited Jun 9Another educational point to beware of from the Hattie analysis linked in the post: "Whole-Language Instruction" has a positive d = 0.06. As "Sold a Story" explains, the Whole-Language people came up with a method, did studies, found statistical significance (I will be charitable and assume no p-hacking) and a positive effect size, and then went around promoting this as the Scientific Way (TM) to help disadvantaged kids to read.

The problem is that Phonics, which the Whole-Language people explicitly were out to replace, has d = 0.70.

Standardised effect sizes may be hard to derive meaning from in general, but "Phonics >>> Whole Language" for kids who don't pick up reading "naturally" seems to be established beyond reasonable doubt, and effect sizes are part of the evidence for that.

I get sent Bloom's two sigma paper approximately once a month. Unsurprisingly, it fails to replicate, although the literature is a mess. https://nintil.com/bloom-sigma/

Can you add some effect sizes for things we are pretty sure deterministically almost always work, to give an idea for how big the scale is supposed to go? Like, effect size for getting shot in the head to be lethal, or for a fresh-out-of-the-factory battery to cause an electric current in a circuit.

A little late to the party but this 2011 metareview finds that the effect size of individual tutoring over classroom learning is closer to .8 than 2.0, and that Bloom’s famous 2-sigma studies were outliers likely caused by factors other than the effect of one-on-one tutoring

https://www.public.asu.edu/~kvanlehn/Stringent/PDF/EffectivenessOfTutoring_Vanlehn.pdf

Political liberalism vs. concern about COVID: 0.33 surprised me. Then I realised how being really shouty on social media can create a false impression around entire groups.

Lets say you have made a new teaching app, you try testing it in two ways. method A) you get the students to use the app and test them immediatly (while they are still in the psyc lab). method B) you test the participants by looking at improvement of their exam score at the end of the semester. Even if you get 0 drop-outs and the immediate-test and exam were the same you would expect method B to have a much lower effect size. When comparing effect sizes in education I often run into this problem, if rather than two methods of studying the same thing it was two studies, I get the problem of should I go with teaching intervention 1 which uses method A and has an effect size of 0.5 or should I use teaching intervention 2 which uses method B and has an effect size of 0.2 --- from this data I have almost no usable information about which one to pick.

That said, I find them a very useful metric.

StrongerByScience released a post that's an instance of this - the researchers said "tiny effect size of using creatine for muscle growth," but the research really says "makes your workouts 1/3 more effective," which is pretty huge! https://www.strongerbyscience.com/creatine-effect-size/