Thanks, that was very informative and cleared up a lot of things for me about the SSRI debate.

Expand full comment

IIRC it’s very common for depressed patients to have to try several medications before they find one that works? If you had 5 drugs that each “cured” 20% of patients running a study of each against a placebo would show little impact. That would be the case even if cycling a patient through all five would cure everyone.

Expand full comment

2 posts in one day? Looks like inflation is finally over.

Expand full comment

I think the stronger way to put this research is that a 0.30 effect size is consistent with curing roughly 30% of patients.

Expand full comment

I've long believed that the easiest way to trick people is to post a number without context.

"Dihydrogen monoxide is LIFE-CHANGING! It has an effect size of .30!"

"Dihydrogen monoxide is WORTHLESS! It has an effect size of only .30!"

...to an uneducated layperson, both of those statements look pretty persuasive!

(It reminds me of the Fukushima disaster, when the public suddenly had to make sense of a bunch of unfamiliar nuclear physics jargon. I remember a friend saying, "if you'd told me last week that bottled water had a high becquerel rating, I would have thought it was a good thing.")

Expand full comment

This is a big problem with studies of infertility. If you have a big pile of people with infertility/miscarriages and you suspect you have a big mix of underlying, poorly understood problems, how good are you going to be at detecting something that makes a big difference for a medium subset of that grab bag?

I strongly suspect that progesterone support in the luteal phase and beyond might make a difference for women whose main problem is low progesterone. But I also expect that low progesterone is an indicator for other problems which aren't fixed by treating the progesterone. (I went on progesterone injections for a week or so for a baby who turned out to be ectopic).

So how do you check whether you're making a difference to enough women to recommend the pills or shots when the numbers are low and you don't know why?

Expand full comment

Great post and useful to clarify what effect size really entails , combines well with other discussions of diet effects

Expand full comment

This is excellent, thank you.

Expand full comment

ACX wrote: "Ibuprofen (“Advil”, “Motrin”) has effect sizes between from about 0.20 (for surgical pain) to 0.42 (for arthritis)."

I'm a pretty healthy retired person. The only "medication" I take regularly is a daily multi-vitamin.

I want to point out that ibuprofen works like a Magic Bullet whenever I need an analgesic. I haven't had any pain in the last 20 years that ibuprofen didn't relieve (and quickly). So I'm very surprised by ibuprofen's low effect size.

Expand full comment

Maybe this is a problem of the tails coming apart. The tests for depression are measuring a variety of correlates. The baseline for each of the correlates may be higher than we as a society would prefer to admit. If a medication brings down one or two of the correlated variables to baseline, that still doesn't result in a depression cure according to a test that measures other things also.

Expand full comment

I think it is astonishing that millions of people have taken these drugs, that are supposed to affect something as significant and obvious as mood, and there is a debate about whether they have any effect at all. So my inclination is to suspect they do indeed have little effect. (This is also "supported"/prejudiced by personal experience/anecdote.)

The other thing that I draw from this is that statistics is hard, and very few people know how to do it properly.

Expand full comment

Can someone clarify what effect size metric they are using? Apologies if I’m missing it, I didn’t see it in a quick skim of the linked blog post

Expand full comment

I used to have a similar objection when I'd see stats like, "ambien reduces sleep latency by an average of 15 minutes".

And, it was like... who are you measuring that on?

Normal people fall asleep within 10-20 minutes, so it couldn't possibly make a healthy person fall asleep faster than that.

And everyone else is going to have a wide range of responses. My natural sleep latency varies between 2 and 6 hours. On ambien, it went down to 10-20 minutes. That's a pretty big difference in terms of, say, being able to hold down a job.

If only 10% of people react that way to the drug, it might still be highly useful for some, regardless of what the average effect is.

The value of a drug can depend highly on the response at the tail of the distribution.

That's how I used to think about this stuff, anyways.

Years later, I eventually became dependent on sleeping pills. When I stopped taking them, I went through a horrific withdrawal experience that has lasted 18 months, so far (lunesta is what ended up harming me, not ambien).

I've since gotten involved in recovery communities and spoken to hundreds of people who've been badly injured by psychiatric drugs. Mostly by benzos, but I've met some who've had terrible withdrawal from antidepressants.

That makes me equally skeptical of any studies that might have been done on withdrawal.

I mean, I'm not sure that drug makers do much research on getting people off of drugs, in general. But, to the extent that they do, they're probably going to make a study that says something like, "the average person has mild discontinuation symptoms that last 4 weeks".

That might even be true, on average, but the worst case outcome may be years of suffering, with a possibility of permanent damage.

Both the benefits and risks of drugs are probably seen best in the tail of the distributions, and trying to explain anything about drugs in terms of mean effect sizes can be misleading.

Expand full comment

I don't believe the effect size of 0.11 for antihistamines and sneezing. I notice that it was published in 1998, in the pre-fexofenadine age (though other excellent second-generation antihistamines existed). I want effect sizes for Allegra on seasonal allergy symptoms and stat.

Expand full comment

I'm a statistician who's done a fair bit of professional work on test design and effects estimation (in a context different from clinical trials). I don't think I've ever found it very useful to normalize the size of an effect to its standard deviation when trying to understand the importance of that effect post hoc. You can use things like signal-to-noise to plan tests, but afterwards, you mostly care about two things:

1. Did the factor (or in this case, intervention/medication) effect the outcome I care about?

2. Do I expect to see that same effect in the future? (In this case, if I give that medication to other folks in the future)

You can round statistical significance up to the second one, but I read this post as being about the first thing. There is no coherent, generalizable definition. An effect size matters if it matters. Expertise and context are very important here. Describing an effect as "large" or "small" within the context of a particular ailment or class of drugs might make sense, but I think I agree with Cuijpers, et al when they say "Statistical outcomes cannot be equated with clinical relevance."

Going a little deeper on that Cuijpers et al (the 0.5 people Scott cited above), they actually seem to suggest that 0.24 is the more clinically-relevant effect size. Their approach of identifying a clinically-relevant different and suggesting that as a cutoff for whether the intervention is useful is consistent with what I've recommended in the past and sounds extremely similar to what Scott is suggesting when he says, "I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience."

Last point: When you're talking heterogenous effects (as in the 'Danish team's second simulation), these notions pretty much go out the window. The average effect of a drug shouldn't be what we measure it on or think about how to proscribe it. We should probably instead think about the likelihood that the drug will have a clinically-significant effect on a given patient or something along those lines. That's not something that will be captured by looking at the 'effect size' in cases where the drug effects some people completely differently than others.

Expand full comment

Absolute effect sizes, not standardized effect sizes, are what you're interested in. We need much more discussion of how big the effect is in those terms (and the necessary associated discussion of how to come up with a rough estimate of how much absolute change on these questionnaires is clinically significant).

Patients want to know how much less depressed a medication will make them. They don't care about whether the drug will improve them by a factor greater than some fraction of the standard deviation of the spread of depression scores in the population!

Expand full comment

How much of this is just a side-effect of trying to collapse a whole distribution of patients into a single "effect size" score? I'm not a statistician, but it seems plausible that something that works *really well* but only for a few people, and something that kinda works for most people, could conceivably end up with the same score, even though those are very different in practical use. Isn't what we're really looking for something more like the effect-magnitude-vs-patients-affected curve like in the first two charts here? Boiling that down into a single number not only glosses over a lot of information, it makes it far easier to demand "make the number higher" without knowing whether that is even mathematically possible.

Expand full comment

Antidepressants clearly help some people some. Recently read a study that was touted as evidence that antidepressants keep on working even after a couple years, though I actually found the results pretty grim: People who'd been on any of 4 antidepressants and felt well and ready to stop were randomly assigned either to really tapering off the drug or to what one might call placebo tapering off: they continued taking their usual dose. (All subjects took identical generic-looking pills.). Outcome measure was how many in each group relapsed. At end of study, 52 weeks out, 56% of those who had tapered off had had depression relapses. (See! That stuff takes a licking and keeps on kicking!). BUT 39% of the people who had not tapered off had also relapsed (oh . . . ). Study is here: https://www.nejm.org/doi/full/10.1056/NEJMoa2106356

Also there's a kind of harm that SSRI's do as they're currently prescribed that I don't see talked about much. Here's a story about an actual patient, aged mid-20's, who came to me for her first meeting and gave the following account: In high school she was a gay kid not out to anybody. She was lonesome and miserable. Went to her MD and described how unhappy she was, but did not disclose her secret. She was put on an SSRI, and believes that it helped at the time. Seven years have passed since then, and she came out years ago and has a long-term girlfriend. But she feels like there's something wrong in their relationship because she has no interest in sex. She also weighs 40 or so pounds more than she did in high school. She isn't exactly depressed, but feels blah and unmotivated. She has tried to come off her antidepressant but when she does she feels terrible within a few days, with flu-like symptoms, irritability, insomnia and head zaps, so she has concluded that she must need the drug. Her health care providers have all routinely continued her SSRI prescription all these years. No one has asked whether she would like to try coming off. No one has told her that the SSRI she is taking is known to have a bad discontinuation syndrome (flu-like symptoms, irritability, insomnia and head zaps) and that if she is going to try stopping the drug she must decrease the dose very slowly.

Expand full comment

About the possibility that maybe each antidepressant is one of those drugs that cures certain people 100%. I believe there was a massive study done where each patient could try up to 3 or 4 antidepressants, with doctors instructed to change antidepressants after a certain period of time if a patient's improvement didn't meet a certain criterion. In fact I'm *sure* something like that was done, maybe at NIMH, but I simply cannot remember any details. Anybody here know something about that study?

Expand full comment

I can think of two factors which lead to such ambiguous results.

1. Is depression an illness or a symptom set caused by multiple illnesses and/or bad habits?

Imagine going to the doctor and being diagnosed with "runny nose disease." Having a runny nose is not psychosomatic or imaginary. It is visible to others and even measurable. But any doctor who diagnosed a patient with "runny nose disease" would probably lose his license.

But imagine if we looked for drugs to cure "runny nose disease". An anti allergy drug would indeed reduce runny nose for those whose nose runs due to allergens. Likewise, avoiding cats would be helpful for a significant number of people. But neither therapy would help against a sinus infection. Conversely, antibiotics would help sinus infections but would do nothing for allergies to cats or pollen.

2. Even given a correct targeted cause, a drug used shotgun style would have mixed effects.

For example, I do not suffer from clinical depression, but I can get depressed if seriously deprived of sleep. Measures to increase serotonin -- such as eating high carb/low protein before bed -- can improve sleep quality in many people, including myself. But I rather like having more protein earlier in the day so I can be awake.

Serotonin boosters can thus be beneficial to help sleep, but also depressing if given during times when wants to be awake. Who wants to be sleepy all the time?

OK, the previous paragraph was speculative, but I can say from experience that I find tryptophan supplements to be useful at times, but also quite tricky. I can take a standard half gram pill before bed and get wonderful sleep -- for a night or two. Then I get bad side effects. A quarter gram dose works better, but even then I wouldn't want to take it regularly.

So, given my personal experience with semi-natural serotonin boosters, I could see where SSRIs could both make depressed people happier AND make them want to shoot up a school or something. Dosage and timing are key.

And some people could be depressed for reasons other than insufficient serotonin at key times.


And all this adds up to why I sometimes put more credence to Amazon reviews than hard science. If a bunch of people really like a drug, supplement, diet, etc. then it means the product in question works for SOMEBODY. Standardized scientific testing procedures can average out signals, but for and against.

Expand full comment

> I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience.

I know this was not you specifically, but we've come a long way from "shut up and multiply", eh?

Expand full comment
May 31·edited May 31

I think part of the difficulty here is that the language of effect size is trying to put all clinical metrics* on the same footing (by dividing the s.d.). This is a useful metric/way to think about the world in general. However, this abstraction obscures at least two salient characteristics that someone might care about:

1. How large are the differences between people** in absolute terms.

2. How important are such differences to people.

Consider a drug that increases lifespan by 6 months, with minimal side effects. I imagine this is the type of drug that many people care a lot about. But this is "only" an effect size of 0.1 (US mean lifespan 78 years, sd 5 years).

While not as extreme as lifespan, I imagine both mood and sleep quality to be characteristics with a) relatively high individual variation and b) quite important to people. So relatively small effect sizes have large clinical and quality-of-life significance.

In contrast, a drug that treats relatively minor ailments with very small individual variation may need a much higher ES to be worth doing.

Much of the problem here is plausibly due to a lack of an actually comparable common metric (like DALYs), instead using an artificial metric (effect sizes) that gives the *illusion* of cross-comparability. I imagine that this is an area that public health economists may be able to impart useful tips to clinicians.

*and intervention analysis metrics in general

**I was originally thinking of the question in terms of general population distributions, but I suppose what you care about more is the population under study.

Expand full comment

One molecule may cure only a fraction of patients, but there are many molecules available, so psychiatrists try one after another until remission. This sequential process has been studied in the STAR*D trial.

Results: "With persistent and vigorous treatment, most patients will

enter remission: about 33% after one step, 50% after two

steps, 60% after three steps, and 70% after four steps

(assuming patients stay in treatment)."

So the whole issue is to find the right molecule for each patient. Clinical trials that test one molecule on random patients cannot capture this.

Source: Gaynes, B. N., Rush, A. J., Trivedi, M. H., Wisniewski, S. R., Spencer, D., & Fava, M. (2008). The STAR*D study : Treating depression in the real world. Cleveland Clinic Journal of Medicine, 75(1), 57‑66. https://doi.org/10.3949/ccjm.75.1.57

Warden, D., Rush, A. J., Trivedi, M. H., Fava, M., & Wisniewski, S. R. (2007). The STAR*D project results : A comprehensive review of findings. Current Psychiatry Reports, 9(6), 449‑459. https://doi.org/10.1007/s11920-007-0061-3

Expand full comment

I'm confused how we land on the underlying measurement tools for these things. The HAM-D test you describe seems (from your description, at any rate) very poorly calibrated for this application, like using my bathroom scale as a kitchen scale then wondering why the bread doesn't come out right. Same with the comment above noting that normal people take ~15 minutes to fall asleep making them a horrible reference for Ambien.

Am I missing something? I would have thought basically everyone's incentives aligned toward using tests likely to actually find something, but then what drives the selection of measurement tools for these types of studies? Is it just much, much harder to create good medical diagnostic tools (even for something like insomnia or post-surgical pain) than I'm imagining?

Expand full comment

There are two major omissions in this post, as far as I can see: depressed people in the control arms of these studies may not respond to placebo (they spontaneously improve, which is a small but relevant difference) and the drugs listed in the chart as having low effect sizes are well known to practicing physicians to be clinically useless, but persist in practice 2/2 commercial interests and institutional capture.

Depressive episodes are self-limiting in a large majority of patients, meaning that even absent a placebo intervention, in 6-8 months they will start feeling better. This is at least what my (non-american) textbooks said about the issue when I read them for licensing. This is different from a placebo effect in that it makes for an even less productive uphill battle for any intervention trying to treat it, since you don't get to profit from the placebo effect in the intervention or the control group to a meaningful amount. There is no real reason to expect a benefit in a large number of patients receiving antidepressants or placebo to begin with, who maximally may benefit in having their episode shortened by a few weeks, which is difficult to prove statistically and not something assessed by the trials in meaningful way. There was never a study, afaik, to compare placebo vs. open label doing nothing in depression, were it run I'd wager it returned a null. This is a largely unexplored problem of the literature, that is consistently used to inflate antidepressant activity: they are a bit better than placebo and placebo effects in depression are large. But are they really? The literature can't tell, as far as I have read it.

On the later point, there has been endless debate about the efficacy of blood pressure meds, statins and any other tertiary prevention program for exactly that reason - the effect sizes are marginal, the NNT is enormous and the data are largely there to conclude there has been no real benefit to the deluge of prescriptions that started around the 1990s. Yet the guidelines keep expanding indications without good data - it has long been speculated that this is due to commercial influence on the committees and on medical research as a whole. So any argument that basically makes the point "marginally effective cardiovascular drug A has a similar effect size to antidepressants, so they must work" ignores what is basically a consensus amongst the more perceptive GPs - that neither have any discernible effect in actual practice and are sold due to aggressive marketing and distorted perceptions of risk and reward by the professional community. So the findings of the second study aren't all that surprising, since antidepressants end up right where they belong next to the other largely ineffective medications. It's just that this proves Kirsch's point, not his critics'...

Expand full comment
May 31·edited May 31

I don't find the high drop-out rate exculpatory.

The side effects of SSRIs are very similar to the physical symptoms of anxiety, and I believe it's being given wholly inappropriately to people with anxiety. When my mother was outpatient for paranoid delusions caused by sleep deprivation which was in turn caused by anxiety, she was given SSRIs. The single most common side effect of SSRIs are sleep problems, so of course they were wholly inappropriate, worsened her mental state, and she had to be hospitalised. It wasn't until she was given benzos that she was able to sleep for several nights in a row and the problem resolved itself.

When I was suffering from sleep problems caused by anxiety, I was also prescribed them. I suffered from sleeplessness, hot flashes, night sweats, and it made my heart race. The side effects are quite severe and definitely worsened my mental state, though fortunately I don't suffer from the same psychatric sleep problem my mother does, so the consequences weren't as bad. I was given a benzo-like drug (Zopiclone) which I used sparingly (knowing the risk of addiction) and was very helpful to get me through the worst of it. No thank to the SSRIs!

Expand full comment

The best antidepressants are the oldest ones — the MAO Inhibitors Parnate and Nardil.

Expand full comment

Seems like a more intuitive way to quantify effect size would be something like "cure rate". Like take P to be the outcome score for the placebo group, X to be the outcome score for the treatment group, and H to be the outcome score for healthy people. The cure rate could be defined as (X-P)/(H-P).

Two limiting cases for intuitively interpreting it:

* If the drug effectiveness varies from person to person but it is always either 100% effective or 0% effective, then this number gives the % of people it worked on.

* If the drug effectiveness is consistent across people then this number gives fraction by which they were moved to the healthy group.

Expand full comment

Weirdly I have run into anti-ibuprofen people. Mainly in Europe where there is a strange bias against taking medication of any kind. They’ll tell you it doesn’t actually work much or at all, it’s just placebo, and there are dangerous side effects.

Expand full comment

Fantastic post. The part that resonated with me most was how effect sizes take the mean effect on a population rather than looking at sub populations where the effect is large and meaningful once it’s undiluted by the non responders. A classic peril of cohort studies


Expand full comment

Somewhat unrelatedly, I would love if you would cover (or perhaps you already have?) people who respond immediately to SSRIs. I feel the effects of very low doses (5mg fluoxetine) the next day. I assumed it was placebo at first but since then I’ve learned that responding to increased serotonin bioavailability may be a thing in PMDD, which I qualify for. I take 5mg a night during my luteal phase and it really helps! It’s kind of a new treatment that not all psychiatrists know about, and I would love to know more about it.

Expand full comment

The thing is, you only need to think about studies and effect sizes and stuff for things which don't work all that well. If we pick some slightly unusual gourmet mushrooms and cook and eat them with a mate who goes into convulsions, we don't dismiss the effect as an uncontrolled retrospective observational study with n=1 and no ethics clearance, we stop eating the mushrooms. Conversely opium and antibiotics were adopted by the medical profession before well designed studies were invented, because they obviously work.

Expand full comment

This was a really interesting article, thanks for posting!

Recently, I was asked if I wanted to pay £1500 for "pharmacogenetic testing" (I think I got the term right). I looked it up because I was skeptical, and I found an anti-psychiatry Reddit thread [1] arguing that its effect size was small. This confirmed what I thought, and I decided not to pay. This article has made me think again.

[1] - https://www.reddit.com/r/Antipsychiatry/comments/w103f9/has_anyone_done_the_pharmacogenetics_gene_test_to/

Expand full comment

Has SSC/ACT ever done a deep dive into Robin Hanson's argument that much of healthcare overall is wasteful? See, for example, the Oregon Medicaid Experiment.

Expand full comment

With regard to the chart by Leucht, I was wondering what measure of effect size they use to get values above one. Certainly not Pearson's r. SDM might denote squared deviations from the mean, which is not mentioned on the Wikipedia article on effect size.




I think that instead having a fixed cut-off at a given effect size, one will obviously want to do a cost benefit analysis over likely treatment options.

I would imagine that the effect size of the "medication" of food to avoid starvation is close to r=1 over placebo (e.g. sawdust). This would highly recommend diets of caviar, french fries, radioactive onions, or what is commonly considered a healthy balanced diet as a method to prevent starvation. However, these treatments vary very much by side effects and costs.

For depression, the sad truth is that we do not have a magic bullet. There is no medication which can simply set HAM-D the way one can adjust blood pressure or thyroxine levels. If such a medication existed, one would use it instead of the stuff with small effect sizes. I think effect sizes are useful to compare different medications (for example, I would want a patent-covered medication costing big bucks to have a significantly bigger effect size to be worth it) and to measure where in the "prescribing effective treatment" vs "grasping at straws" spectrum we are.


What about antibiotics for infections? From my understanding, viruses generally do not respond to them, while bacteria do. That would mean that according to the NICE criteria, antibiotics would not be indicated to treat a life-threatening infection which is caused by viruses 60% of the time (at least until the type of pathogens is confirmed). I am not a MD, but my impression is that this is not how it works.

Expand full comment

As someone who has participated (as a patient) in antidepressant clinical trials, I will say that part of the issue is that depression presents differently in different people, it is a complex multifactorial condition, and subjectively, from my experience, scales like HAM-D basically suck at measuring depression.

If the drug clears up one factor that really bothers the patient (e.g. anhedonia, lethargy, etc) and has a large effect on their quality of life, the patient will feel like it's helping even if that effect is swamped by all the things it didn't help that are measured by the scale.

That's as opposed to e.g. metformin, near the top of the effect size chart presented, which is being scored against an objective single factor (blood glucose).

Expand full comment

Great post! I practically agree with everything here.

Some additional considerations:

1) This is a great paper on effect sizes in psychological research (mostly correlations), where they look at well-understood benchmarks or concrete consequences, and conclude "... an effect-size r of .20 indicates a medium effect that is of some explanatory and practical use even in the short run and therefore even more important, and an effect-size r of .30 indicates a large effect that is potentially powerful in both the short and the long run. A very large effect size (r = .40 or greater) in the context of psychological research is likely to be a gross overestimate that will rarely be found in a large sample or in a replication."


2) Estimates of what constitutes a "clinically relevant" or "clinically significant" effect in depression are all over the place. This 2014 paper by Pim Cuijpers et al uses a somewhat unconventional approach but arrives at a tentative clinical relevance cutoff of SMD = 0.24. I think it's a good illustration that we can pick a variety of thresholds using different methods and its not clear why any one threshold should be privileged.


3) Effect size expressed as Cohen's d is an abstract and uncontextualized statistic, which makes practical relevance quite unclear. In the case of antidepressants (and psychotherapy), however, the problem goes beyond a mere reliance on Cohen's d. A Cohen's d of 0.3 corresponds to a 2 point difference on HAM-D, and critics would say "Cohen's d of 0.3 may or may not be clinically relevant, but surely a 2 point difference on HAM-D, a scale that goes from 0-52, doesn't mean anything."

The problem with this, in my view, is that

i) a reliance on *average effect* obscures meaningful heterogeneity in response


ii) we conflate a 2 point change in HAM-D from baseline with a 2 point HAM-D difference from placebo, and our intuitions regards the former do not carry over well to the latter. A 2 point change in HAM-D may mean very little but a 2 point different from placebo could mean a lot (especially since antidepressant effects and placebo/expectancy effects are not summative)

iii) A 2 point change in HAM-D, depending on what exactly that change is, might still be quite meaningful. If the depressed mood item goes from a 4 (pervasive depressed mood) to a 2 (depressed mood spontaneously reported verbally), and nothing else changes, that 2 point change may very well be quite significant for the patient.

You've alluded to i) and ii) in your post as well, but I just wanted to make explicit that this goes beyond Cohen's d to actual differences on rating scales. Overall, I think the whole antidepressant efficacy controversy is a lesson in how a near-exclusive reliance on research statistics with arbitrary cut-offs can mislead us with regards to clinical significance of a phenomenon.

Expand full comment

What is d? That is, what is the standard deviation?

The obvious thing to do is use the population standard deviation on HAM-D. The drawback is that you might worry about systematic difference in how the test is administered and you want to use only data from your study. But you aren't studying the population. You are studying a population with a large standard deviation (eg, if you cure half of the people, the sample afterwards is diverse) which could exaggerate the effect size and suppress d. Is this how the standard deviation is defined?

The distribution of HAM-D on the general population is bimodal with d=5 between normal people and depressed people. You only need to cure 10% of them to achieve the d=0.5 criterion, or 20% to achieve the d=1 criterion. (Here the standard deviation is not for the population, but of the normal mode.)

But what if you include mildly depressed people? If they are in the normal part of the distribution, then HAM-D registers them as not depressed. Why did you label them depressed? Not because of HAM-D. If it cannot detect their illness, it cannot detect their cure.

Expand full comment

Typo: “even a drug that significantly improves 100% of patients improve”

Expand full comment

And almost no one takes into account the fact that repeated takings of the same drug invoke learning responses like classical conditioning. I wrote about this several years ago and should probably come back to it to see what's happened in the literature since 2017.


Expand full comment

You should probably mention explicitly what effect size metric is being used. I'm pretty sure I know, based on familiarity with the topic, but it's not obvious.

Expand full comment

Cures for depression as mentioned in the Danish simulation study may not be just a theoretical construct. Psychedelics are showing a lot of promise.




Expand full comment

Someone (Sarah Constantin?) did a lot with encouraging depressed people to keep records of their moods because depressed people are bad at noticing whether their moods improve. Does this play into efforts to evaluate anti-depressants?

Expand full comment

"the average height difference between men and women - just a couple of inches"

Only if "a couple" = "five", which I normally would not say. Supposedly average male height is 5'9" and average female is 5'4" for the whole of the USA.

So probably best to treat it as a hypothetical saying.

Expand full comment

There _is_ an anti-ibuprofen lobby, composed of those unfortunate individuals who thought that ibuprofen was the sort of safe medication you could use whenever you felt pain -- even though many of these people suffered from chronic pain. So they kept dosing themselves with ibuprofen, often with the approval of their doctors, who in many cases thought that any improvement their patients reported was due to the placebo effect -- but so what? It's not, they thought, as if the ibuprofen could hurt their patients. This turns out to be untrue, and their patients ended up with stomach ulcers, and liver and kidney damage. So be careful when taking the stuff.

Expand full comment

The "drug that completely cures a fraction of its users" reminds me of Tim Leary's early finding that interventions tends to make 1/3 better, 1/3 worse, and leave 1/3 the same.


Expand full comment

I’ve long been curious what you think of Whittaker’s “Anatomy of an Epidemic”. Been a while since I read it, but the gist of it was that people have life stresses that lead to depression, like death of a relative or loss of a job or what have you; in the old days, for most people time and emotional support would be the great healer, but now they get antidepressants that might make that healing time more tolerable — but then can’t kick the med and so are treated as somebody who is still depressed.

I found it pretty plausible even though I also feel like SSRIs is what got me through my parents’ dementia and death. But I had a terrible time kicking the drug, with two relapses, finally succeeding only by tapering it slowly over eighteen months. (Toward the end a non-psychiatrist was puzzled by the minuscule dose I reported, from cutting pills into fractions and taking one only every few days; he snorted and said, “You might as well be smelling them.” And maybe I was overdoing the gradualness of my taper, but it worked for me.)

I don’t think Whitaker was going full-Thomas-Szasz and denying that depression is a medical thing, but rather arguing that depression is vastly over-diagnosed (because/and therefore) SSRIs are vastly overprescribed. If lots more people are getting diagnosed than should be, might that help explain the feebleness of these results?

If you’ve read Whitaker and found it bosh, this would also be interesting to hear.

Expand full comment

I feel like this is one the most important things I've ever read. It makes me trust psychiatry a lot more.

Expand full comment

How does this interact with "number needed to treat". That seems like a better metric to look at for effectiveness in this case. (Of course even with a NNT of 100, for that 1 person out of 100, it could be life changing.)

Expand full comment

Thanks for this very informative post. I think it highlights the risks of allowing experts to issue binding guidance on clinical practice. We still have a lot of people (including some doctors) who think NICE is the standard of excellence we should try to implement.

Expand full comment

“Individual variation is the norm”

Lisa Feldman Barrett

Expand full comment

Suppose there was a doctor patient conversation that went something like this:

(After trying an SSRI)

Patient: I could feel a strong defect after 4 hours.

Doctor: SSRIs don't work that way. Some drugs have an effect that rapid, but SSRIs don't. How would you feel if I said that was placebo effect?

Patient: yes, it probably was. I am well aware of the arguments that the effects of SSRIs are mostly - possibly entirely - down to the placebo effect.

Doctor: I am really very worried by that


Question: why would the doctor be worried?

An observation: some people get a fast reaction from SSRIs. Current medical orthodoxy seems to be that this is entirely placebo effect - which kind of implies that the therapeutic benefit is almost entirely placebo in these cases.

Maybe we should be taking sugar pills :-)

Expand full comment

How well do most of these studies adjust for the fact that there are likely huge pharmacogenomic differences in how these drugs are metabolized and therefore in the actual drug concentrations ("actual dosages") between individuals? Do they test everyone? If not, isn't this a really big source of potential confusion?

Expand full comment

This article mostly made me think that the claim "this drug has a meaningless effect size" is almost always going to be true. You have a number and you're calling it the "effect size", but the term doesn't mean anything and you don't know why you're doing it at all.

Two things leap out at me from this article. I can't be sure that they are really problems, but they make me queasy.

1. The amount of improvement measured for our hypothetical patients is based on the HAM-D scale. The points on this scale do not have any associated meaning. If you changed the HAM-D instrument to report different numbers, in a fully deterministic way -- the questions would be the same, the answer options would be the same, only the reported numbers would be different -- it appears to me that the measured effect size in the new "rescaled" HAM-D would be different from the measured effect size in the same set of surveys scored under the original HAM-D. Imagine that all scores below 10 are doubled, while all scores above 10 have ten added to them. (Old 7 becomes new 14. Old 24 becomes new 34. Why would we do that? Why not? This isn't much different from adding ten questions that are highly duplicative of existing questions.)

But the actual improvement we care about is in the different survey answers, not the numbers we arbitrarily assign to them. If our methodology gives us different effect sizes for the same set of answers based on nothing but the *labels* that we give to those answers, that's telling us that we have no actual way of assigning meaning to the effect size we measure.

2. In principle, the standard deviation is a statistical construct that is equally well defined regardless of what your probability distribution looks like.

But the conclusions you'd want to draw from a measurement of standard deviation are radically different depending on what your probability distribution looks like.

I tend to suspect that general rules about what effect sizes should count are formulated with a normal distribution in mind. I don't think they will translate well to other distributions. Here, on our scale from 0 to 54 in which, by definition, a large majority of people score in the range 0-7, we have a very odd distribution. It is certainly not normal (or close to being normal); it's also unlikely to be similar enough to the score distribution of any other instrument that a lesson from one could be usefully applied to the other.

Expand full comment

When all the Kirsch stuff came out I looked at the data myself, but have not since. But what I noticed then was 1) for people with SEVERE depression it was clear that fluoxetine (at the time the most prescribed) worked extremely well. Kirsch would continually omit this fact. 2) the HAM-D seems faulty in that it is too wide a net. It could be argued that people who would be categorized as "mild" or even "moderate" may not actually have major depressive disorder. If the measure is faulty, all results coming from that measure are "off." Let me add that veterinarians prescribe these drugs for animals (all the time, for a wide range of animals) for a reason. They work. If they work on animals, that eliminates arguments about the placebo effect.

Expand full comment

I don't think we should care about effect size at all outside other than assessing:

1) Does the benefit outweigh the risk (large risks should only be taken if there are large benefits)?

2) Is it greater than placebo? (Because then there is a clearly lower risk treatment for more benefit)

Especially if the benefit of a treatment is *additive* with other treatments, then many small improvements could be a big improvement for a patient. This is why I'm annoyed when things are dismissed for "only" improving a patient pain score 1 point on a 1-10 scale. If you can stack 3 of those it's a huge QOL improvement.

Expand full comment

Could there be a "garbage in, garbage out" effect, wherein effect size calculations are only as informative as symptom measurements?

Somewhat related, in the comments of a previous post, I semi-seriously challenged you to forecast the effect sizes and response rates of the hopefully-published-this-year phase iii intranasal s-ketamine, racemic ketamine, and r-ketamine monotherapy trials - how well do you think you'd be able to forecast them?

Expand full comment

"If standard deviation is very high, this artificially lowers effect size."

I'm just here to say that the point of an effect size is to get a standardized estimate of the effect. ***Dividing by the SD is a feature, not a bug.*** For all the other stuff (ITT, treatment effect heterogeneity, etc.), as with all analyses, GIGO (also I don't strictly mean "garbage" for all research considerations, more precise would be that it limits either the generalizability or specificity, either way reducing usefulness).

Expand full comment

Facile analogy:

Levi 501 jeans are a pretty good cure for lower-body nakedness. Say they come in 50 waist/leg combinations and we have no insight into matching combo to patient. So for any given combo which we try on 50 random patients it will fit say 3, do at a pinch for another 7 and be useless for 40. Contrariwise, a large towel which can be wrapped round the waist will be an inferior cure but will sort of work for everyone, and will score better in tests than any given levis combo, and therefore than levis in general.

Expand full comment

"There’s no anti-ibuprofen lobby trying to rile people up about NSAIDs, so nobody’s pointed out that this is “clinically insignificant”. But by traditional standards, it is!"

It seems that in diseases where symptoms are relatively distinct and measurable, standards for evaluating efficacy are somewhat relaxed. Whereas in disorders where symptoms are fluid and difficult to quantify, standards for evaluating efficacy of drugs is high. This would seem to make some sense if we consider the fact that it's easier to make false effect claims in the latter than the former case. Hence, the intuitive upping of acceptance threshold.

Expand full comment
Jun 3·edited Jun 3

Leucht et al.'s finding raises the question of trusting the observations of individual physicians. The number of patients any individual doctor sees is clearly too small to see consistent effects for many treatments; otherwise there wouldn't be the need for much larger studies (leaving aside the benefits of eliminating bias in RCTs)! I'd imagine many physicians probably use a kind of null statistical hypothesis testing approach in their assessment, which will have a null hypothesis of zero that needs to be rejected by strong evidence to the contrary. Some doctors will see solid effects while many will see zero to "small" effects around 0.3 and for the majority, a small real effect could be swallowed in the noise with most physicians being able to reasonably argue "I see no reliable effect" even if there is a small non-zero effect. That said, we probably don't want physicians to stop trusting the evidence of their eyes, which could be dependent on factors present in that physician's local patient population. What's the right balance?

Are there tools that provide physicians easy ways (i.e.., easier than run your own R models, which few non-academic doctors will do) to track their own patient outcomes and to test whether their observed outcomes are in line with expected effect sizes? This could give physicians a sense of "my patients tend to generally do better or worse than the reported effects," possibly indicating problems with prescription preferences or patient population differences from the general population. Do many physicians *quantitatively* track their outcomes across patients? I'd guess many don't.

Going along with the question of how easily a doctor could detect effects in a patient, how strong of evidence is needed to justify keeping an individual patient on a given SSRI vs. trying another one? Not being familiar with psychiatric guidelines, I'd be curious to know what the recommendations are.

Expand full comment

Many thanks for your post. Statistics are wonderful things but ultimately are the drugs worth having especially if the trials are sponsored by big pharma?

I have had a few medications myself. Did you know that 'medication' is the anagram of 'decimation'?


I have looked at pharmaceuticals. 'pharmaceutical' is an anagram of 'uh a malpractice'.


Given these words and big pharma's record to date I am hardly surprised.

Expand full comment

I did read one of your links. Yes, doctors are not as knowledgeable as they could be about diet. They still know far more about diet than the average person on the street, though. You seem to be conflating individual mistakes, which are common, with bad research. There are meta analyses showing that masks reduce transmission from infected individuals to healthy ones.

You're throwing out quite a lot of evidence created by tens hundreds of thousands of people in different countries. Not even the soviets rejected germ theory. They used bacteriophage to treat typhus in their army.

Expand full comment