IIRC it’s very common for depressed patients to have to try several medications before they find one that works? If you had 5 drugs that each “cured” 20% of patients running a study of each against a placebo would show little impact. That would be the case even if cycling a patient through all five would cure everyone.
Wouldn't the order in which the meds were tried have a big effect on that? You could get unlucky and the 5th med you take is the "right" one while someone else takes it the first time and it works.
Prior expectations with the potential to drive (variable/inconsistent) placebo effects could be reduced after a few failures to find a successful cure...
Yes. And we're talking about something that tends to be quite chronic and in many cases is not an emergency. So, assuming you're a patient who is being treated willingly in an office, as long as the side effects aren't too severe, being able to go through a bunch and find the one that works for you is a lot better than not having that option because each of them is too unlikely to help. (The costs are higher if coercion, hospitalization, or both are involved.)
Moreover, as Scott noted, the placebo effect is strong. If a drug chemically helps 20% or even 10% of patients, that's enough hope to get a placebo effect. And given that we're talking about depression, a placebo effect may be just as good as a chemical effect.
I recall a joke that may have some basis in fact that the require for double blind studies prevents the marketing and sale of quite a few very effective placebos.
I mean, this is kind of the kicker, especially for SSRIs. I know several people for whom either the sex drive effects or weight gain effects (especially for afabs, many of whom also have weight-related dysmorphia as a component of their depression) made them voice preference for depression over the side effects.
Given that a placebo effect is so strong, one might wonder if psychiatrist-prescribed-literal-placebo with zero side effects might be worth trying for some patients. (Which would also help rule out nocebo side effects?)
You could tell the patient that they're getting a placebo - IIRC some studies have found that the placebo effect still works even if the patient knows they're getting a placebo, which seems kinda wild.
Does it mean "affects the patient's symptoms"? Does it mean "maximally affect the patient's symptoms?" Does it mean "maximally affect the patient's symptoms and produce a minimum of unwanted side effects?"
Excellent point! I don't know the answer, but it prompts me to wonder about a related question: If cycling patients through several candidate medications stops at, perhaps, a good-enough drug, how often does clinical practice stop prematurely, missing a better drug for that patient?
At a guess, all the time. I suffer terribly from depressive episodes. I have hit on a drug therapy which sort of fixes them but leaves me (I suspect) depressed by most people's standards - as in, I no longer spend all day thinking of ways of killing myself, but I never lose the belief that it would be better if I had not been born. State 2 is so much preferable to state 1, and so many drug therapies have been completely ineffective, that I am not prepared to risk relinquishing the existing therapy for a possibly better one.
Yep, you see the same thing with other drug categories, including allergy medications (some percentage of the population responding to each of Allegra, Zyrtec, and Claritin, and some not). As Scott notes, it's only depression meds that attract such controversy and ire for exhibiting this function.
I've long believed that the easiest way to trick people is to post a number without context.
"Dihydrogen monoxide is LIFE-CHANGING! It has an effect size of .30!"
"Dihydrogen monoxide is WORTHLESS! It has an effect size of only .30!"
...to an uneducated layperson, both of those statements look pretty persuasive!
(It reminds me of the Fukushima disaster, when the public suddenly had to make sense of a bunch of unfamiliar nuclear physics jargon. I remember a friend saying, "if you'd told me last week that bottled water had a high becquerel rating, I would have thought it was a good thing.")
Yes completely agree. A number can be thrown in to add some sense of authoritative reasoning to what is purely an emotional statement. You'll often see words just like that, "life-changing", "worthless", either as-is or as selective quotes, "Scientists say...", and if you want to really get buy-in to your campaign or whatever, it's emotion and not facts that get peoples attention. That's even before the poor person has time to determine whether or not the statement has any basis in fact, ping, suddenly it's the emotion that takes over and now you're part of this tribe of the righteous quest, and all you need to know is that the baddies are in that direction.
The thing that used to get me was the difference between aerosols and droplets. I assumed it was very significant based on context clues, the fact that there even was an distinction. But macroscopic sand particles can float hundreds of miles on air current, so there's really few flecks so heavy they can't possibly float through the air in some quantity given the right circumstances. There was a similar issue with perfumes and colognes. Some molecules were labeled 'too heavy to fly.' That tended to be misleading.
That is reminiscent of the debate over Covid and mask-wearing. Skeptics would scoff that a virus on its own, being so small, would float freely through a standard mask like a bumble bee flying through an orchard. But as viruses are usually carried in water droplets, which are significantly larger and possibly charged, masks were hopefully more effective at blocking those, in either direction.
True. And masks did seem to be pretty good at blocking outgoing particles for exactly that reason. They could *potentially* be used to block incoming particles, but people's mask-related hygiene tends to be pretty awful without training, to the point that benefits against pathogens in general in some studies ( not covid specifically) ranged from slightly positive to harmful. Though meta analyses showed slight benefit to the wearer. People break the seal with their finger, fold masks up and put them in their pockets, wear them on their chin then put them back over their mouths, pressing pathogenic particles against their mouth, etc.
I had particularly encountered the issue of aerosols and droplets with face shields as replacements or helpers for masks. Face shields slowed the spread of droplets and helped protect wearer's eyes, but droplets and aerosols eventually diffused around the shield.
It's a pity that face shields weren't more helpful in preventing spread from infected individuals to others because face shields were more comfortable and didn't hide a person's face. There were people who were comfortable with shields who didn't like masks.
Masks do, demonstrably, protect others if the wearer is sick.
Face shields are worn by some medical professionals so they're not completely useless. If nothing else, they make it less likely you'll get something in your eye. (So will glasses, to some extent.) But they may not make for good cost-benefit in normal life. And shields are not a replacement for masks. And yes, some people are very self conscious about wearing them, and for good reason, but that wasn't part of the question for me.
Masks reduce dust, sand and diesel smuts for example but will not protect someone from the 'flu. This is because the 'flu is an internal poisoning and not some microscopic mutating bug.
Face shields will reduce likelihood of getting dust of flies in one's eyes for example and I will wear safety glasses or sunglasses if necessary while cycling. A face shield is good for arc welders and disc cutting.
But they are plain stupid for the 'flu. Many medical professionals are among the most stupidest people on the planet today as by and large they do not understand what the 'flu is.
I had to research this properly to find out what was really going on. I was undergoing immuno-therapy in 2020 with foolish mask wearing nurses and one even had a mask and a face shield to attend to a lady's feet whilst the lady was having her chemo or immuno-therapy.
I never wore a mask because some idiot doctor or nurse told me I must because of COVID 19 which is the 'flu re-branded to make big pharma etc more money.
I went through my treatment inc. April May 2020 when the supposed crisis was at its peak so when the NHS started saying masks were required I knew it was rubbish. In fact the whole thing was guidance as per the gov.uk website but most people didn't check.
Masks will help against sand, dust and diesel fumes to a degree but not against the 'flu or COVID 19 as it is now known after re-branding to help big pharma etc. make more money.
The 'flu is an internal poisoning as I explain here.
Virology is fundamentally flawed and what is called the virus is in fact the exosome part of the body's defense system so friend not foe. Sub-links on this etc. in my Covid 19 Summary link within post above.
This is a big problem with studies of infertility. If you have a big pile of people with infertility/miscarriages and you suspect you have a big mix of underlying, poorly understood problems, how good are you going to be at detecting something that makes a big difference for a medium subset of that grab bag?
I strongly suspect that progesterone support in the luteal phase and beyond might make a difference for women whose main problem is low progesterone. But I also expect that low progesterone is an indicator for other problems which aren't fixed by treating the progesterone. (I went on progesterone injections for a week or so for a baby who turned out to be ectopic).
So how do you check whether you're making a difference to enough women to recommend the pills or shots when the numbers are low and you don't know why?
ACX wrote: "Ibuprofen (“Advil”, “Motrin”) has effect sizes between from about 0.20 (for surgical pain) to 0.42 (for arthritis)."
I'm a pretty healthy retired person. The only "medication" I take regularly is a daily multi-vitamin.
I want to point out that ibuprofen works like a Magic Bullet whenever I need an analgesic. I haven't had any pain in the last 20 years that ibuprofen didn't relieve (and quickly). So I'm very surprised by ibuprofen's low effect size.
This. I'd expect that patients with severe pain like surgical pain can be would consistently report that Iboprofen reduced their pain somewhat but not enough.
I note that this is a different *kind* of low effect size than antidepressants, where anecdotal report is that aa minority of patients report large improvement while the majority report no effect.
Yes, I believe that tylenol and ibuprofen are effective at reducing the needed dose of opiates in severe pain even if they are inadequate by themselves.
Evidently my brevity caused you to arrive at an incorrect conclusion about why I use ibuprofen. To be more specific, I don't use ibuprofen for headaches, muscle aches, joint pain, hangovers, or what-have-you because those very rarely occur to me. I didn't even take ibuprofen during my mild case of COVID last summer, despite a fevered day & a half.
The most common times I've taken ibuprofen in the last ~20 years has been after oral surgery. Does that count as surgical pain? I'd say so. I had a wisdom tooth removed in a bloody extraction (crossed roots) nearly 20 years ago, and 400 mg of ibuprofen handled the post-op pain very nicely. (I passed on the oral surgeon's offer of a scrip for Percocet.)
More recently, I've had 3 dental implants placed in the last 5-6 years. The tissue damage was usually slight in those procedures but, still, the periodontist extracted teeth and then drilled holes in my maxilla. When the doctor asked what I preferred and when I told him ibuprofen, he wrote a scrip for 600 mg tablets. One of those following his treatments and I was always good to go.
Still standing by my surprise at that 0.2 effect size, given my specific uses.
In my experience, postsurgical pain from oral surgery is drastically less intense and disabling than postsurgical pain from abdominal surgery. It's more comparable to nonsurgical acute pain like a broken bone or a bad scrape than to having your internal organs rearranged.
I personally prefer NSAIDs because I hate the way opioids make me feel, so I'm not disputing your overall point that ibuprofen is effective. But the day after an extraction or a dental implant, I can pop 2 ibuprofen and be good to go about my normal activities. The day after a bowel resection, prescription-strength IM Toradol barely touched the pain; I could barely walk 10 steps to the bathroom in my hospital room. On day 3, when I still wasn't going for walks or making progress on my lung capacity exercises, my doctor gave my Dilaudid clicky button to my boyfriend and told him to dose me when I looked like I was in pain. As much as I hated it, I have to admit it was more effective.
I agree with your level-of-pain comparison for abdominal vs. dental surgery. My wife's been through a bowel resection surgery, and even a laparoscopic procedure (like hers was) is no picnic. Anyone would prefer dental surgery to that experience.
If we look at the discussion below, the following question comes in mind: how many people claiming that ibuprofen works like a magic bullet actually experiences only placebo effect?
I am not doubting your story. I am just saying that there is a chance that for many people who claim that the medicine works including ibuprofen, it is due to placebo effect.
Today someone wrote that he felt cheated that the pharmacist in Latvia sold him a homeopathic product for conjunctivitis. Many people defended the pharmacist by saying that for them this homeopathic product worked wonderfully by quickly clearing the infection. How can we reconcile their stories with the fact that the homeopathic product is basically water? Conjunctivitis tend to resolve by itself in most cases in a few days with or without any antibiotics. But the beliefs that people have are much harder to change.
Well said. What is in effect correlation is not causation. I now consider that dehydration more of an issue and neuro-toxic drugs like ibuprofen are pointless. If they are taken with a glass of water then who's to say the water isn't the cure.
Big pharma won't like people saying that as it is bad for its business.
Maybe this is a problem of the tails coming apart. The tests for depression are measuring a variety of correlates. The baseline for each of the correlates may be higher than we as a society would prefer to admit. If a medication brings down one or two of the correlated variables to baseline, that still doesn't result in a depression cure according to a test that measures other things also.
To me, the height example indicates that effect-size isn't measuring what we want to measure. What we actually want to measure is whether a drug meets our expectations. E.g. 3 inches from wearing heels is "small" when compared to total height, but large when compared to our expectations. In the case of drugs, the "total-effect" is comprised of 10,000 factors on your mental state, one of which is a pill.
Unfortunately, I can't think of an easy, rigorous way to operationalize "measure against expectations". Except to maybe just contextualize effect size against other similar studies.
Have you tried taking the hamilton? What if a pill makes it easier for me to stay asleep but makes it harder to stop perseverating about killing myself? What if it increases my tearfulness but also increases my appetite?
I'm not saying Split Tails isn't true. I'm wondering if Split Tails is the appropriate level of analysis.
In the least convenient world, suppose that the correlates for a subpopulation in Shambhala are perfectly correlated. Does this resolve the paradox? I.e. do the effect-sizes of anti-depressants match Scott's intuition?
I'm leaning toward "no". Because however correlated the correlates, the effect of the medication on mood might still be swamped by the noise of other factors. E.g. what the subject had for breakfast, what the weather is like, whether they have a meeting scheduled at work, what rush-hour traffic was like, etc. For medication to have an effect that's noticeable when judged against "total effect", it would need to be strong enough to overwrite other factors. And I sort of doubt that Alexander's "Terrible, Horrible, No Good, Very Bad Day" can go from melancholic to ecstatic in the absence of recreational drugs.
What I'm suggesting is: if we were expecting SSRI's to be able to swing up to 100% of our mood, maybe our expectations were unrealistic. Or if we *demand* meds that can swing mood up to 100%, maybe recreational drugs like MDMA and psilocybin shouldn't be illicit. Of course, few doctors are willing to prescribe these. But then it should no surprise when the effect-sizes of "meds that don't rock the boat" seem low.
I think it is astonishing that millions of people have taken these drugs, that are supposed to affect something as significant and obvious as mood, and there is a debate about whether they have any effect at all. So my inclination is to suspect they do indeed have little effect. (This is also "supported"/prejudiced by personal experience/anecdote.)
The other thing that I draw from this is that statistics is hard, and very few people know how to do it properly.
I think there's a debate because millions of people who have taken them are common-sensically very sure they have an effect (including me!) but the studies kept showing they didn't.
Well, SSRIs have been used extensively for 40 years and the suicide rate today is HIGHER than it was back then. If psychiatry continues tu use the same drugs why are the results going to change? Maybe psydelich medicine could be more successful. Ketamine is widely used (still no endorsement from the APA in spite of the failure of the monoamines). I read that MDMA may be approved soon (why, considering the suicide and overdose catastrophe it hasn't been approved yet escapes me).
No rct every that shows reduction in suicide competition.
I think there was one medication in Finnish study that show actual reduction in suicide attempts but sadly also showed an increase number of suicide completions!
But suicide completion by depressed people is such a rare event that you would have to have a huge number of subjects to catch a reduction. Lifetime prevalence of suicide completion is about 4% (higher for men, lower for women). If we assume the average life span of a depressed perrson is 50 years, and the study followed people for 5 years, then we'd expect suicide completion to be 0.4% of the sample. And you have to have quite large numbers of subjects to get enough suicides in the treated and untreated groups to statistically compare them. If you have 1000 subjects in treated group and another thousand in placebo group, you can expect about 4 suicides in each. Even differences that look notable to the naked eye (say 2 in treated, 5 in untreated,) are not going to reach statistical significance.
Stating the unstated here, the popular theory for why this happens is that the anti-depressant works or has a placebo effect that mimics the same which gives the person greater physical energy and self-efficacy. This is what they need to get up the resolve to successfully attempt suicide in their still depressed state.
What a "rare" event is depends on your ability to distinguish it. If you can analyze the decays of a million muons per hour, 10^-7 might not be particularly rare. However, humans are larger than muons (citation needed), and the logistics of medicating a million humans per hour and studying their life outcomes are significantly more challenging.
Suicide is rare but overdose deaths are not, and there is a very thin line between the two. And overdose deaths have never been higher. The monoamine hypothesis has failed, it is time for something new.
Aren't you ignoring at least one other explanation for why overdose deaths have never been higher? (i.e. that fentanyl is much easier to overdose on accidentally than its forebears.)
2 points: how many studies have suicide as a primary outcome? second, there is a sad but actually quite persuasive argument that there is such a thing as being too depressed to attempt, and/or too depressed to succeed in, suicide. so an *increase* in suicide in the early stages reflects the AD improving the patient's capacity for goal-directed activity.
Is there any evidence to support the speculation "that increased suicide attempts" are a result of of "improvement in depression"? I don't find that argument persuasive at all.
If "too depressed to attempt suicide" is a thing then one should be able to identify and predict that behavioral milieu. Perhaps you have research to support this, which I'd be happy to read.
No research, but from personal experience things like agency and initiative improve coming out of depression, well before affect. So it seems plausible to me.
Speaking as a depressed non-suicidal individual, I do not think that measuring the effect of depression in suicides is a good idea. I guess I would trade ten years of depressed life for eight years of non-depressed life, which would indicate a loss of 0.2 QALY per year due to depression. That might add up to be a significant amount compared to the expected QALYs lost due to suicide.
Also, suicide rates feel way easier to Godhart than depression rates. If you tell people that they will go to hell for suicide, but to heaven if they die during a crusade, and have a high risk crusade ongoing, that should be sufficient.
Thank you for your reply. To me the problem with this Danish study is that drugs like ibuprofen, Ambien, Ritalin, benzodiazapines seem to have indisputable effects. And it seems that all psychoactive drugs with indisputable effects are controlled substances (more controlled than ordinary prescription drugs). There's a huge need to doctors to be able to "do something" with the many patients who come complaining of depression, and SSRI's are basically harmless (in the eyes of most medical professionals) pills that can be handed out to nearly anyone. I mean, they are worth trying, right? (And that is basically my opinion of them, and in any case I don't want to influence people not to try treatments, and this is not medical advice, etc.)
I am a doctor too. I don't work in a field where I prescribe drugs to people any more, but I worked in general medicine for a while. I guess the little clinical experience I have from patient reports suggests SSRI's have mild effects (but on the other hand, people tend to come to professionals when they are at their worst, and depression is cyclical, and then there is the placebo effect, and the pressure people feel to say something helped a bit). So I don't know.
I have also taken them for depression myself, and they didn't have any appreciable effect on my mood. I would frankly expect them to have stronger effects on non-depressed individuals as well, if they are actually effective drugs.
Again, this is just the direction my own thinking has taken me.
Not sure why you think ibuprofen has an "indisputable" effect and antidepressants don't. Some people don't respond at all to ibuprofen. Others respond great even though it turns out they were in the placebo group. You get the same muddled results with antidepressants, where some people notice an "indisputable" effect and others don't see much improvement. What's the difference?
Well, there is no debate about whether or not ibuprofen is an effective mild painkiller. (And it also lowers fever, which can be measured.) Another thing is that ibuprofen works pretty soon after it is taken, whereas antidepressants have this odd lag of a few weeks.
When you say "some people don't respond to ibuprofen", do you mean there are people who don't respond at all for any condition at all? The percentage of such people would seem to be low, compared to the number of people who don't respond on antidepressants.
> When you say "some people don't respond to ibuprofen", do you mean there are people who don't respond at all for any condition at all?
You can't possibly study that, because you can't give someone every possible condition and then see which ones ibuprofen helps with. Studies will usually just look at the efficacy of ibuprofen for a single condition.
> The outcome preferred by the International Headache Society (IHS) is being pain free after two hours. This outcome was reported by 23 in 100 people taking ibuprofen 400 mg, and in 16 out of 100 taking placebo. The result was statistically significant, but only 7 people (23 minus 16) in 100 benefited specifically because of ibuprofen 400 mg.
> The IHS also suggests a range of other outcomes, but few were reported consistently enough for them to be used. People with pain value an outcome of having no worse than mild pain, but this was not reported by any study.
So it helped 7% of people lose their headache within two hours.
I note that these measure would ignore a reduction from "severe" to "moderate" pain; I don't think you can take that measure and say that everyone who didn't have their pain totally removed is a non-responder to the drug.
For me codeine has very little effect. In the past it used to be given to children and sometimes children reported that it doesn't work and doctors tended to disbelieve their words. Now we know that a certain population (actually large part in some countries) are slow codeine metabolizers in pathways that are needed for codeine to work.
Anecdatal evidence, but I've noticed that there are some headaches I get which ibuprofen solves nicely, and others which ibuprofen does basically nothing for (or which I pattern-match to the no-solution type and therefore don't take anything for). My wife is like this also, but I'm not sure I could consciously articulate what the difference is between the two categories. Migraines I think fall into the latter.
But if placebos are also pretty effective, how much of that perceived effect is itself placebo? In other words, are you very sure that (SSRI effect - placebo effect) is significant?
Of course, one can argue that, clinically, we don't ultimately care about (drug - placebo), just about (drug), because if the patient gets better, mission accomplished. But it still leaves open the question about the drug.
(drug) has lots of nasty side effects, and (Placebo Pills TM) have the advantage that it's fairly easy to have them *not* have lots of nasty side effects, so if all the benefit is placebo it would be a real and meaningful harm reduction to prescribe 'homeopathic' versions of the drugs
Unless the side effects are driving the placebo effect. Something like "feel bad" -> "take pill" -> "feel side effects" -> "pill must be working" -> "feel better through thinking the pill must be working rather than because the pill is actually working"
But the placebo half of these studies are (presumably?) not getting a placebo with side effects. Even if the effect of the SSRI is entirely the sort of placebo effect you are describing, the net result should be “better than placebo”.
ETA: Or, if they *are* using placebos with superficial side effects, like a bitter taste or whatever, and this is why the SSRI does not do appreciably better, then Thor is right and we should be studying different placebos.
Regarding the placebo effect, how do you explain that many people find that one SSRI helps them and another doesn't? It seems that the placebo effect would be equal in this scenario.
This isn't much of a puzzle. Both placebo effects and SSRI responses are highly individual & variable, with different subpopulations doing different things. At the overall population level (say, in a large RCT), it can *both* be true that placebo effects are a large portion of the response *and* particular SSRIs are effective for particular subpopulations.
But placebos have an effect too. And they (the placebos) might have even more if they tasted bitter ( unless placebo tastes like proposed treatment it may not be blind test.)
It's always benefit and risk calculus.
Placebos have benefit but nearly zero risk. But many commonly prescribed rxs have small benefit but some risk.
I don't think there is one RCT that shows any anti-depressant or anti-anxiety actually reduces suicide completions.
So there is a question of proper endpoint especially when effect size for non endpoint is so small, but risk is there.
An rx that cures suicide taking but has no or little effect of feelings of depression is better than Rx that has no effect on suicides but has some effect on feeling of depression.
I disagree with that last sentence, actually. For values of 'some' it might be true, trivially, but "feelings of depression" are a real QoL destroyer and curing those feelings is very valuable. Especially if the drug that 'cures' suicides is removing ability rather than desire to seek death, all it's really doing is buying time for other treatments to fix the depression.
(Stopping suicides no more cures depression than giving a person a wheelchair cures paralysis; it's a meaningful symptom being treated but it is far from the whole issue)
QoL is a pretty squishy thing. It is not well operationalized.
Should we seek to increase QoL of course but increasing QoL without preventing suicide (and for some of these medications in some situations increasing suicides) is no good.
This seems obviously wrong? Both preventing suicides and increasing QoL are desirable endpoints, and if there's a trade-off between them it gets more complicated, but a drug that takes a population of 1000 depressed people and makes 500 of them no longer depressed without at all changing the suicide rate is a very useful drug.
I think there is a world of a difference between "having a small but significant effect size" (SSRI) and "having no measurable effect size" (homeopathy).
I went through a bad depressive spell last summer during my divorce. I could not shake it, and it was affecting my life, work, parenting, etc.
SSRI's are the only thing that allowed me to pull myself and my life together to start making changes.
As you indicate, and as a result, my bar for being convinced that these poorly understood medications do not actually help (and did not help me more than a placebo) is comically high. I'm heavily inclined to believe such a study was flawed or missed something; the complexity and specificity of most of the studies I've looked into does nothing to dissuade me from this view.
The conventional advice for medicating depression is "try a bunch of things until you find something that works."
if we take this to be a good algorithm, it immediately suggests that depression in the population is caused by a mixture of different upstream conditions, and different medications target different ones.
If any 1 drug only targets at most 10% of depression cases, you're always going to see low effect sizes even when the drugs are extraordinarily effective for the people they are suited for.
So, the debate is about any "average effect" (conditional on showing up in a pysch office), and we can easily imagine models of the world where good drugs have "little effect" on the population. Doesn't imply that they're useless.
The closer a drug gets to treating the condition directly, vs upstream causes, the larger its apparent avg effect will be. Ibuprofen seems to be in this case—regardless of the cause, it will help with pain and inflammation.
Not sure this is a good algorithm. People take many herbal remedies that do nothing or close to nothing. Which is not to say that all herbal remedies are useless but many are.
Another example - many bodybuilders take many supplements that do either nothing or almost nothing.
So if you were to measure the effect on depression as measured by HAM-D + 2D6 then Cohen's d would be lower because the averages are the same but the sd is bigger? I.E. effect size is a measure of both how powerful the intervention is and how good we are at measuring it and we are bad at measuring depression.
Yeah, that's my understanding. That kind of makes sense though. As we get worse at measuring, the measured effect size goes down. In the extreme, if we're maximally bad at measuring such that all measurements are completely random, the effect size goes to zero.
This is partly what the Danish team in the article is suggesting: that a different measure of depression could increase effect sizes.
Effect size gives a somewhat better idea of whether an effect is large than a simple percent change. Imagine if we found that people in country A are on average 2.3 inches taller than country B. We could report that as people in country A are 3% taller on average. Is 3% a lot? It depends on what you're studying. For height, 3% is a pretty big difference - roughly the difference in height between white male Americans and males of the tallest countries in the world (Netherlands, Germany). In a lot of other contexts, 3% is a very small difference.
I used to have a similar objection when I'd see stats like, "ambien reduces sleep latency by an average of 15 minutes".
And, it was like... who are you measuring that on?
Normal people fall asleep within 10-20 minutes, so it couldn't possibly make a healthy person fall asleep faster than that.
And everyone else is going to have a wide range of responses. My natural sleep latency varies between 2 and 6 hours. On ambien, it went down to 10-20 minutes. That's a pretty big difference in terms of, say, being able to hold down a job.
If only 10% of people react that way to the drug, it might still be highly useful for some, regardless of what the average effect is.
The value of a drug can depend highly on the response at the tail of the distribution.
That's how I used to think about this stuff, anyways.
Years later, I eventually became dependent on sleeping pills. When I stopped taking them, I went through a horrific withdrawal experience that has lasted 18 months, so far (lunesta is what ended up harming me, not ambien).
I've since gotten involved in recovery communities and spoken to hundreds of people who've been badly injured by psychiatric drugs. Mostly by benzos, but I've met some who've had terrible withdrawal from antidepressants.
That makes me equally skeptical of any studies that might have been done on withdrawal.
I mean, I'm not sure that drug makers do much research on getting people off of drugs, in general. But, to the extent that they do, they're probably going to make a study that says something like, "the average person has mild discontinuation symptoms that last 4 weeks".
That might even be true, on average, but the worst case outcome may be years of suffering, with a possibility of permanent damage.
Both the benefits and risks of drugs are probably seen best in the tail of the distributions, and trying to explain anything about drugs in terms of mean effect sizes can be misleading.
I don't believe the effect size of 0.11 for antihistamines and sneezing. I notice that it was published in 1998, in the pre-fexofenadine age (though other excellent second-generation antihistamines existed). I want effect sizes for Allegra on seasonal allergy symptoms and stat.
I'm a statistician who's done a fair bit of professional work on test design and effects estimation (in a context different from clinical trials). I don't think I've ever found it very useful to normalize the size of an effect to its standard deviation when trying to understand the importance of that effect post hoc. You can use things like signal-to-noise to plan tests, but afterwards, you mostly care about two things:
1. Did the factor (or in this case, intervention/medication) effect the outcome I care about?
2. Do I expect to see that same effect in the future? (In this case, if I give that medication to other folks in the future)
You can round statistical significance up to the second one, but I read this post as being about the first thing. There is no coherent, generalizable definition. An effect size matters if it matters. Expertise and context are very important here. Describing an effect as "large" or "small" within the context of a particular ailment or class of drugs might make sense, but I think I agree with Cuijpers, et al when they say "Statistical outcomes cannot be equated with clinical relevance."
Going a little deeper on that Cuijpers et al (the 0.5 people Scott cited above), they actually seem to suggest that 0.24 is the more clinically-relevant effect size. Their approach of identifying a clinically-relevant different and suggesting that as a cutoff for whether the intervention is useful is consistent with what I've recommended in the past and sounds extremely similar to what Scott is suggesting when he says, "I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience."
Last point: When you're talking heterogenous effects (as in the 'Danish team's second simulation), these notions pretty much go out the window. The average effect of a drug shouldn't be what we measure it on or think about how to proscribe it. We should probably instead think about the likelihood that the drug will have a clinically-significant effect on a given patient or something along those lines. That's not something that will be captured by looking at the 'effect size' in cases where the drug effects some people completely differently than others.
Normalizing effect size is meaningless for calculating utility, but it provides insight on the effect size of unknown confounders. If a normalized effect size is, say, less than 0.1, that could be caused by improper blinding and researcher biases creeping in, without outright fraud. If d > 2, that's no longer a reasonable hypothesis.
In this case it might make more sense to normalize the effect by the difference between the average score of the depressed vs the healthy population. If it gets the average patient halfway to healthy, for example, that seems like it would be clinically relevant.
Absolute effect sizes, not standardized effect sizes, are what you're interested in. We need much more discussion of how big the effect is in those terms (and the necessary associated discussion of how to come up with a rough estimate of how much absolute change on these questionnaires is clinically significant).
Patients want to know how much less depressed a medication will make them. They don't care about whether the drug will improve them by a factor greater than some fraction of the standard deviation of the spread of depression scores in the population!
"Patients want to know how much less depressed a medication will make them. They don't care about whether the drug will improve them by a factor greater than some fraction of the standard deviation of the spread of depression scores in the population!"
Agreed. I'd guess that, ideally, one would want to compare how much less depressed a medication will make them with the disadvantages to them of that particular medication - presumably mostly side effects.
How much of this is just a side-effect of trying to collapse a whole distribution of patients into a single "effect size" score? I'm not a statistician, but it seems plausible that something that works *really well* but only for a few people, and something that kinda works for most people, could conceivably end up with the same score, even though those are very different in practical use. Isn't what we're really looking for something more like the effect-magnitude-vs-patients-affected curve like in the first two charts here? Boiling that down into a single number not only glosses over a lot of information, it makes it far easier to demand "make the number higher" without knowing whether that is even mathematically possible.
I think this is critically important, and not recognized nearly often enough, in medical research on clinical 'syndromes' and 'disorders'.
When you're testing a treatment for a 'disease' with known pathophysiology, you can assume that it should work on almost everyone if it works at all. There will be some exceptions with weird metabolic interactions, and some people will need to discontinue for allergies/side effects, but everyone else should have some kind of positive response, and the question is whether the response is strong enough to be useful/worth the risks.
But if you're treating a clinical diagnosis with unknown pathophysiology, you can't assume uniformity. There could be significant heterogeneity in the underlying biochemical mechanisms of depression in different patients.
Not only might your measured effect size be depressed by the fact that only, say, 25% of patients have the 'right' kind of depression to be helped by your drug, but it might be further depressed if, say, 10% of patients have a kind of depression that gets worse when exposed to your drug. Which would be valuable information, but only when disaggregated; mashing it into the "effect size" stew creates a meaningless number.
Antidepressants clearly help some people some. Recently read a study that was touted as evidence that antidepressants keep on working even after a couple years, though I actually found the results pretty grim: People who'd been on any of 4 antidepressants and felt well and ready to stop were randomly assigned either to really tapering off the drug or to what one might call placebo tapering off: they continued taking their usual dose. (All subjects took identical generic-looking pills.). Outcome measure was how many in each group relapsed. At end of study, 52 weeks out, 56% of those who had tapered off had had depression relapses. (See! That stuff takes a licking and keeps on kicking!). BUT 39% of the people who had not tapered off had also relapsed (oh . . . ). Study is here: https://www.nejm.org/doi/full/10.1056/NEJMoa2106356
Also there's a kind of harm that SSRI's do as they're currently prescribed that I don't see talked about much. Here's a story about an actual patient, aged mid-20's, who came to me for her first meeting and gave the following account: In high school she was a gay kid not out to anybody. She was lonesome and miserable. Went to her MD and described how unhappy she was, but did not disclose her secret. She was put on an SSRI, and believes that it helped at the time. Seven years have passed since then, and she came out years ago and has a long-term girlfriend. But she feels like there's something wrong in their relationship because she has no interest in sex. She also weighs 40 or so pounds more than she did in high school. She isn't exactly depressed, but feels blah and unmotivated. She has tried to come off her antidepressant but when she does she feels terrible within a few days, with flu-like symptoms, irritability, insomnia and head zaps, so she has concluded that she must need the drug. Her health care providers have all routinely continued her SSRI prescription all these years. No one has asked whether she would like to try coming off. No one has told her that the SSRI she is taking is known to have a bad discontinuation syndrome (flu-like symptoms, irritability, insomnia and head zaps) and that if she is going to try stopping the drug she must decrease the dose very slowly.
I'm no big fan of antidepressants (see the rest of my post), but what I have read, which is various clinical trials, various lit reviews and a coupla books) say that in antidepressant trials treatment groups show more reduction in depression than placebo groups, as measured by self-report and clinician-administered structured interviews.
Suicide is a bad end point to assess, both very rare so hard to see in the data, and also probably only a smallish portion of the total harm depression does, ie its very bad for you and people around you if you kill yourself, but its also really bad if you are depressed, and if there are ten thousand depressed years of life for every suicide, the ten thousand years probably is a bigger thing in the total utility calculations, even though each suicide is vastly more important than any given year of being depressed.
"Also there's a kind of harm that SSRI's do as they're currently prescribed that I don't see talked about much."
I am a licensed doctor. I always go over typical side effects of SSRI:s when starting them or controlling them, and ask them one by one. Sexual side effects in particular, as it seems to me that most doctors do not pay enough attention to them.
Rare side effects I don't give much room unless there is a specific reason to worry about them, because they are by definition rare, and might discourage the patient from starting a medication which might improve or even save their life. Not treating depression or anxiety with medications also has possible side effects.
So I sort of agree that yeah, a lot of medical practitioners aren't aware enough of (sexual) side effects of SSRI:s, but I'm willing to be the change I wish to see happen. The patient in your story would never leave my appointment without being thoroughly aware of her SSRI's possible sexual side effects or SSRI discontinuation symptoms.
I’m glad to hear of somebody who’s doing this. You should also tell them about weight gain. Last figure I saw was body weight increasing 3% a year. In fact there’s a grim joke about SSRI side effects:
Pt: I am miserable
Dr: Here are some pills that willl make you fat and anorgasmic.
About the possibility that maybe each antidepressant is one of those drugs that cures certain people 100%. I believe there was a massive study done where each patient could try up to 3 or 4 antidepressants, with doctors instructed to change antidepressants after a certain period of time if a patient's improvement didn't meet a certain criterion. In fact I'm *sure* something like that was done, maybe at NIMH, but I simply cannot remember any details. Anybody here know something about that study?
I can think of two factors which lead to such ambiguous results.
1. Is depression an illness or a symptom set caused by multiple illnesses and/or bad habits?
Imagine going to the doctor and being diagnosed with "runny nose disease." Having a runny nose is not psychosomatic or imaginary. It is visible to others and even measurable. But any doctor who diagnosed a patient with "runny nose disease" would probably lose his license.
But imagine if we looked for drugs to cure "runny nose disease". An anti allergy drug would indeed reduce runny nose for those whose nose runs due to allergens. Likewise, avoiding cats would be helpful for a significant number of people. But neither therapy would help against a sinus infection. Conversely, antibiotics would help sinus infections but would do nothing for allergies to cats or pollen.
2. Even given a correct targeted cause, a drug used shotgun style would have mixed effects.
For example, I do not suffer from clinical depression, but I can get depressed if seriously deprived of sleep. Measures to increase serotonin -- such as eating high carb/low protein before bed -- can improve sleep quality in many people, including myself. But I rather like having more protein earlier in the day so I can be awake.
Serotonin boosters can thus be beneficial to help sleep, but also depressing if given during times when wants to be awake. Who wants to be sleepy all the time?
OK, the previous paragraph was speculative, but I can say from experience that I find tryptophan supplements to be useful at times, but also quite tricky. I can take a standard half gram pill before bed and get wonderful sleep -- for a night or two. Then I get bad side effects. A quarter gram dose works better, but even then I wouldn't want to take it regularly.
So, given my personal experience with semi-natural serotonin boosters, I could see where SSRIs could both make depressed people happier AND make them want to shoot up a school or something. Dosage and timing are key.
And some people could be depressed for reasons other than insufficient serotonin at key times.
------
And all this adds up to why I sometimes put more credence to Amazon reviews than hard science. If a bunch of people really like a drug, supplement, diet, etc. then it means the product in question works for SOMEBODY. Standardized scientific testing procedures can average out signals, but for and against.
> I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience.
I know this was not you specifically, but we've come a long way from "shut up and multiply", eh?
Shut up and multiply is about utilitarianism. I don't think even Eliezer ever said you have to believe all studies. See eg this Less Wrong post ( https://www.lesswrong.com/posts/vrHRcEDMjZcx5Yfru/i-defy-the-data ) which I found really helpful in prefiguring the replication crisis and helping me stay sane until everyone admitted that most studies don't replicate.
I think part of the difficulty here is that the language of effect size is trying to put all clinical metrics* on the same footing (by dividing the s.d.). This is a useful metric/way to think about the world in general. However, this abstraction obscures at least two salient characteristics that someone might care about:
1. How large are the differences between people** in absolute terms.
2. How important are such differences to people.
Consider a drug that increases lifespan by 6 months, with minimal side effects. I imagine this is the type of drug that many people care a lot about. But this is "only" an effect size of 0.1 (US mean lifespan 78 years, sd 5 years).
While not as extreme as lifespan, I imagine both mood and sleep quality to be characteristics with a) relatively high individual variation and b) quite important to people. So relatively small effect sizes have large clinical and quality-of-life significance.
In contrast, a drug that treats relatively minor ailments with very small individual variation may need a much higher ES to be worth doing.
Much of the problem here is plausibly due to a lack of an actually comparable common metric (like DALYs), instead using an artificial metric (effect sizes) that gives the *illusion* of cross-comparability. I imagine that this is an area that public health economists may be able to impart useful tips to clinicians.
*and intervention analysis metrics in general
**I was originally thinking of the question in terms of general population distributions, but I suppose what you care about more is the population under study.
One molecule may cure only a fraction of patients, but there are many molecules available, so psychiatrists try one after another until remission. This sequential process has been studied in the STAR*D trial.
Results: "With persistent and vigorous treatment, most patients will
enter remission: about 33% after one step, 50% after two
steps, 60% after three steps, and 70% after four steps
(assuming patients stay in treatment)."
So the whole issue is to find the right molecule for each patient. Clinical trials that test one molecule on random patients cannot capture this.
Source: Gaynes, B. N., Rush, A. J., Trivedi, M. H., Wisniewski, S. R., Spencer, D., & Fava, M. (2008). The STAR*D study : Treating depression in the real world. Cleveland Clinic Journal of Medicine, 75(1), 57‑66. https://doi.org/10.3949/ccjm.75.1.57
Warden, D., Rush, A. J., Trivedi, M. H., Fava, M., & Wisniewski, S. R. (2007). The STAR*D project results : A comprehensive review of findings. Current Psychiatry Reports, 9(6), 449‑459. https://doi.org/10.1007/s11920-007-0061-3
I'm confused how we land on the underlying measurement tools for these things. The HAM-D test you describe seems (from your description, at any rate) very poorly calibrated for this application, like using my bathroom scale as a kitchen scale then wondering why the bread doesn't come out right. Same with the comment above noting that normal people take ~15 minutes to fall asleep making them a horrible reference for Ambien.
Am I missing something? I would have thought basically everyone's incentives aligned toward using tests likely to actually find something, but then what drives the selection of measurement tools for these types of studies? Is it just much, much harder to create good medical diagnostic tools (even for something like insomnia or post-surgical pain) than I'm imagining?
There are two major omissions in this post, as far as I can see: depressed people in the control arms of these studies may not respond to placebo (they spontaneously improve, which is a small but relevant difference) and the drugs listed in the chart as having low effect sizes are well known to practicing physicians to be clinically useless, but persist in practice 2/2 commercial interests and institutional capture.
Depressive episodes are self-limiting in a large majority of patients, meaning that even absent a placebo intervention, in 6-8 months they will start feeling better. This is at least what my (non-american) textbooks said about the issue when I read them for licensing. This is different from a placebo effect in that it makes for an even less productive uphill battle for any intervention trying to treat it, since you don't get to profit from the placebo effect in the intervention or the control group to a meaningful amount. There is no real reason to expect a benefit in a large number of patients receiving antidepressants or placebo to begin with, who maximally may benefit in having their episode shortened by a few weeks, which is difficult to prove statistically and not something assessed by the trials in meaningful way. There was never a study, afaik, to compare placebo vs. open label doing nothing in depression, were it run I'd wager it returned a null. This is a largely unexplored problem of the literature, that is consistently used to inflate antidepressant activity: they are a bit better than placebo and placebo effects in depression are large. But are they really? The literature can't tell, as far as I have read it.
On the later point, there has been endless debate about the efficacy of blood pressure meds, statins and any other tertiary prevention program for exactly that reason - the effect sizes are marginal, the NNT is enormous and the data are largely there to conclude there has been no real benefit to the deluge of prescriptions that started around the 1990s. Yet the guidelines keep expanding indications without good data - it has long been speculated that this is due to commercial influence on the committees and on medical research as a whole. So any argument that basically makes the point "marginally effective cardiovascular drug A has a similar effect size to antidepressants, so they must work" ignores what is basically a consensus amongst the more perceptive GPs - that neither have any discernible effect in actual practice and are sold due to aggressive marketing and distorted perceptions of risk and reward by the professional community. So the findings of the second study aren't all that surprising, since antidepressants end up right where they belong next to the other largely ineffective medications. It's just that this proves Kirsch's point, not his critics'...
The side effects of SSRIs are very similar to the physical symptoms of anxiety, and I believe it's being given wholly inappropriately to people with anxiety. When my mother was outpatient for paranoid delusions caused by sleep deprivation which was in turn caused by anxiety, she was given SSRIs. The single most common side effect of SSRIs are sleep problems, so of course they were wholly inappropriate, worsened her mental state, and she had to be hospitalised. It wasn't until she was given benzos that she was able to sleep for several nights in a row and the problem resolved itself.
When I was suffering from sleep problems caused by anxiety, I was also prescribed them. I suffered from sleeplessness, hot flashes, night sweats, and it made my heart race. The side effects are quite severe and definitely worsened my mental state, though fortunately I don't suffer from the same psychatric sleep problem my mother does, so the consequences weren't as bad. I was given a benzo-like drug (Zopiclone) which I used sparingly (knowing the risk of addiction) and was very helpful to get me through the worst of it. No thank to the SSRIs!
Sorry to hear about your experience but SSRIs have a good evidence base in treating anxiety disorders, particularly at higher doses - https://pubmed.ncbi.nlm.nih.gov/30479005/
You are correct that when starting or changing doses of SSRIs that side-effects can be similar to anxiety symptoms, but these are usually transient (so for most people are worth pushing through)
My point is that the effect size being small "because of the high drop out rate" isn't exculpatory because for this subset of the population SSRIs are genuinely bad. Having to drop out because the side effects are too severe to tolerate is a genuine signal there's a problem with the medication.
I halfway expect SSRIs "work" for anxiety because once you adjust to the dosage, you feel like your anxiety symptoms have improved, merely because the side effects have subsided, not because your original problem is actually better.
I don't think the data support this theory - in RCTs anxiety scores improve with SSRIs steadily over 0 to 12 weeks of taking the drug (at no time-point are anxiety scores higher than they were at baseline).
As medications go, SSRIs are generally well tolerated. Obviously that doesn't mean they don't have side-effects (like all medications) but for most people these are bearable.
For a strict majority of people, it's clearly bearable. But a 1/3 dropout rate is nevertheless really high. I feel misled by claims that the side effects of SSRIs are generally mild.
If the primary symptom of anxiety you are struggling with is insomnia, you should not be prescribed a medication that it itself causes insomnia in 30% of people even though that's a "minority". This is unacceptably high.
It seems to me that traditional effect size criteria underestimate how effective antidepressants are, but traditional "how bad are the side effects" standards probably also underestimate how unpleasant they are. That suggests we should be paying attention to the side effects more when prescribing antidepressants, but it also suggests a systematic problem with how we're measuring both kinds of things.
Anecdotally, on the forums I visit, whenever SSRIs come up people who used them for depression are pretty meh about them, and people who used them for GAD are exceptionally positive about them. Lexapro has effectively cured my GAD.
Same here (SSRIs destroy my anxiety disorders). One of my frustrations with the studies discussed here - and in most discussions - is that they focus on depression, and I never hear about effectiveness for anxiety treatment.
I was quite surprised at how long effective treatments for depression have been available. I'd have assumed that until around, say, the 1950s a patient telling a doctor they were depressed would be urged in reply to pull themselves together and snap out of it, or else prescribed aspirin! But one reads in an obituary of mathematician J E Littlewood on page 332 of:
"Throughout his long life he lost little of his will power or of his mental force and clarity. Moreover, from 1957, he was freed from his earlier periods of depression. A perspicacious psychiatrist, having traced their origin to a fault in the functioning of the central nervous system rather than to any adverse circumstances, surveyed the recent advances in knowledge of antidepressant drugs and successfully prescribed treatment giving protection from recurrence of the symptoms."
Seems like a more intuitive way to quantify effect size would be something like "cure rate". Like take P to be the outcome score for the placebo group, X to be the outcome score for the treatment group, and H to be the outcome score for healthy people. The cure rate could be defined as (X-P)/(H-P).
Two limiting cases for intuitively interpreting it:
* If the drug effectiveness varies from person to person but it is always either 100% effective or 0% effective, then this number gives the % of people it worked on.
* If the drug effectiveness is consistent across people then this number gives fraction by which they were moved to the healthy group.
You've pretty much reinvented (slightly transformed) Number Needed to Treat, if I'm reading you correctly. My impression is that NNT if anything gives more counterintuitive results - widely accepted medications have confusingly large NNTs.
Weirdly I have run into anti-ibuprofen people. Mainly in Europe where there is a strange bias against taking medication of any kind. They’ll tell you it doesn’t actually work much or at all, it’s just placebo, and there are dangerous side effects.
I am somewhat similar; ibuprofen does work on me, but requires a borderline dangerous dose to do so. Acetaminophen works fine.
EDIT: I'm also sadly immune (or the closest thing to it) to the beneficial effects of pot and alcohol. If I smoke/drink/ingest enough of it, I get dizzy, but that's all the effect I ever get (besides the usual hangover symptoms). Maybe the two phenomena are connected somehow ?
I'm not "anti-ibuprofen" but I've never noticed it having much of an effect on me. Same with Tylenol and most pain-relievers you can buy easily in stores. OTOH Anti-Biotics, Recreational Drugs, and ADHD medication I've tried always have an obvious, very easily noticeable effect
Fantastic post. The part that resonated with me most was how effect sizes take the mean effect on a population rather than looking at sub populations where the effect is large and meaningful once it’s undiluted by the non responders. A classic peril of cohort studies
Somewhat unrelatedly, I would love if you would cover (or perhaps you already have?) people who respond immediately to SSRIs. I feel the effects of very low doses (5mg fluoxetine) the next day. I assumed it was placebo at first but since then I’ve learned that responding to increased serotonin bioavailability may be a thing in PMDD, which I qualify for. I take 5mg a night during my luteal phase and it really helps! It’s kind of a new treatment that not all psychiatrists know about, and I would love to know more about it.
The thing is, you only need to think about studies and effect sizes and stuff for things which don't work all that well. If we pick some slightly unusual gourmet mushrooms and cook and eat them with a mate who goes into convulsions, we don't dismiss the effect as an uncontrolled retrospective observational study with n=1 and no ethics clearance, we stop eating the mushrooms. Conversely opium and antibiotics were adopted by the medical profession before well designed studies were invented, because they obviously work.
For the practical question of "should I eat this mushroom", sure. But the effect size of a hypothetical mushroom study would be very helpful for calibrating the metrics being used.
This was a really interesting article, thanks for posting!
Recently, I was asked if I wanted to pay £1500 for "pharmacogenetic testing" (I think I got the term right). I looked it up because I was skeptical, and I found an anti-psychiatry Reddit thread [1] arguing that its effect size was small. This confirmed what I thought, and I decided not to pay. This article has made me think again.
Has SSC/ACT ever done a deep dive into Robin Hanson's argument that much of healthcare overall is wasteful? See, for example, the Oregon Medicaid Experiment.
With regard to the chart by Leucht, I was wondering what measure of effect size they use to get values above one. Certainly not Pearson's r. SDM might denote squared deviations from the mean, which is not mentioned on the Wikipedia article on effect size.
I think that instead having a fixed cut-off at a given effect size, one will obviously want to do a cost benefit analysis over likely treatment options.
I would imagine that the effect size of the "medication" of food to avoid starvation is close to r=1 over placebo (e.g. sawdust). This would highly recommend diets of caviar, french fries, radioactive onions, or what is commonly considered a healthy balanced diet as a method to prevent starvation. However, these treatments vary very much by side effects and costs.
For depression, the sad truth is that we do not have a magic bullet. There is no medication which can simply set HAM-D the way one can adjust blood pressure or thyroxine levels. If such a medication existed, one would use it instead of the stuff with small effect sizes. I think effect sizes are useful to compare different medications (for example, I would want a patent-covered medication costing big bucks to have a significantly bigger effect size to be worth it) and to measure where in the "prescribing effective treatment" vs "grasping at straws" spectrum we are.
--
What about antibiotics for infections? From my understanding, viruses generally do not respond to them, while bacteria do. That would mean that according to the NICE criteria, antibiotics would not be indicated to treat a life-threatening infection which is caused by viruses 60% of the time (at least until the type of pathogens is confirmed). I am not a MD, but my impression is that this is not how it works.
As someone who has participated (as a patient) in antidepressant clinical trials, I will say that part of the issue is that depression presents differently in different people, it is a complex multifactorial condition, and subjectively, from my experience, scales like HAM-D basically suck at measuring depression.
If the drug clears up one factor that really bothers the patient (e.g. anhedonia, lethargy, etc) and has a large effect on their quality of life, the patient will feel like it's helping even if that effect is swamped by all the things it didn't help that are measured by the scale.
That's as opposed to e.g. metformin, near the top of the effect size chart presented, which is being scored against an objective single factor (blood glucose).
Great post! I practically agree with everything here.
Some additional considerations:
1) This is a great paper on effect sizes in psychological research (mostly correlations), where they look at well-understood benchmarks or concrete consequences, and conclude "... an effect-size r of .20 indicates a medium effect that is of some explanatory and practical use even in the short run and therefore even more important, and an effect-size r of .30 indicates a large effect that is potentially powerful in both the short and the long run. A very large effect size (r = .40 or greater) in the context of psychological research is likely to be a gross overestimate that will rarely be found in a large sample or in a replication."
2) Estimates of what constitutes a "clinically relevant" or "clinically significant" effect in depression are all over the place. This 2014 paper by Pim Cuijpers et al uses a somewhat unconventional approach but arrives at a tentative clinical relevance cutoff of SMD = 0.24. I think it's a good illustration that we can pick a variety of thresholds using different methods and its not clear why any one threshold should be privileged.
3) Effect size expressed as Cohen's d is an abstract and uncontextualized statistic, which makes practical relevance quite unclear. In the case of antidepressants (and psychotherapy), however, the problem goes beyond a mere reliance on Cohen's d. A Cohen's d of 0.3 corresponds to a 2 point difference on HAM-D, and critics would say "Cohen's d of 0.3 may or may not be clinically relevant, but surely a 2 point difference on HAM-D, a scale that goes from 0-52, doesn't mean anything."
The problem with this, in my view, is that
i) a reliance on *average effect* obscures meaningful heterogeneity in response
ii) we conflate a 2 point change in HAM-D from baseline with a 2 point HAM-D difference from placebo, and our intuitions regards the former do not carry over well to the latter. A 2 point change in HAM-D may mean very little but a 2 point different from placebo could mean a lot (especially since antidepressant effects and placebo/expectancy effects are not summative)
iii) A 2 point change in HAM-D, depending on what exactly that change is, might still be quite meaningful. If the depressed mood item goes from a 4 (pervasive depressed mood) to a 2 (depressed mood spontaneously reported verbally), and nothing else changes, that 2 point change may very well be quite significant for the patient.
You've alluded to i) and ii) in your post as well, but I just wanted to make explicit that this goes beyond Cohen's d to actual differences on rating scales. Overall, I think the whole antidepressant efficacy controversy is a lesson in how a near-exclusive reliance on research statistics with arbitrary cut-offs can mislead us with regards to clinical significance of a phenomenon.
What is d? That is, what is the standard deviation?
The obvious thing to do is use the population standard deviation on HAM-D. The drawback is that you might worry about systematic difference in how the test is administered and you want to use only data from your study. But you aren't studying the population. You are studying a population with a large standard deviation (eg, if you cure half of the people, the sample afterwards is diverse) which could exaggerate the effect size and suppress d. Is this how the standard deviation is defined?
The distribution of HAM-D on the general population is bimodal with d=5 between normal people and depressed people. You only need to cure 10% of them to achieve the d=0.5 criterion, or 20% to achieve the d=1 criterion. (Here the standard deviation is not for the population, but of the normal mode.)
But what if you include mildly depressed people? If they are in the normal part of the distribution, then HAM-D registers them as not depressed. Why did you label them depressed? Not because of HAM-D. If it cannot detect their illness, it cannot detect their cure.
And almost no one takes into account the fact that repeated takings of the same drug invoke learning responses like classical conditioning. I wrote about this several years ago and should probably come back to it to see what's happened in the literature since 2017.
You should probably mention explicitly what effect size metric is being used. I'm pretty sure I know, based on familiarity with the topic, but it's not obvious.
Judicious use of psilocybin has been very helpful for me. Cyclical major depression has been a constant in my life. I have gone back and forth with sertraline since the 90s, and I would have to say it has been helpful even if the whole effect might really be placebo; I can’t rule it out. I recently stopped taking it, just because, so we will see. Microdosing psilocybin is very tangible though. And a game changer for my psyche. I just hope the mental health authorities don’t fuck it up.
Someone (Sarah Constantin?) did a lot with encouraging depressed people to keep records of their moods because depressed people are bad at noticing whether their moods improve. Does this play into efforts to evaluate anti-depressants?
"the average height difference between men and women - just a couple of inches"
Only if "a couple" = "five", which I normally would not say. Supposedly average male height is 5'9" and average female is 5'4" for the whole of the USA.
So probably best to treat it as a hypothetical saying.
There _is_ an anti-ibuprofen lobby, composed of those unfortunate individuals who thought that ibuprofen was the sort of safe medication you could use whenever you felt pain -- even though many of these people suffered from chronic pain. So they kept dosing themselves with ibuprofen, often with the approval of their doctors, who in many cases thought that any improvement their patients reported was due to the placebo effect -- but so what? It's not, they thought, as if the ibuprofen could hurt their patients. This turns out to be untrue, and their patients ended up with stomach ulcers, and liver and kidney damage. So be careful when taking the stuff.
The "drug that completely cures a fraction of its users" reminds me of Tim Leary's early finding that interventions tends to make 1/3 better, 1/3 worse, and leave 1/3 the same.
I’ve long been curious what you think of Whittaker’s “Anatomy of an Epidemic”. Been a while since I read it, but the gist of it was that people have life stresses that lead to depression, like death of a relative or loss of a job or what have you; in the old days, for most people time and emotional support would be the great healer, but now they get antidepressants that might make that healing time more tolerable — but then can’t kick the med and so are treated as somebody who is still depressed.
I found it pretty plausible even though I also feel like SSRIs is what got me through my parents’ dementia and death. But I had a terrible time kicking the drug, with two relapses, finally succeeding only by tapering it slowly over eighteen months. (Toward the end a non-psychiatrist was puzzled by the minuscule dose I reported, from cutting pills into fractions and taking one only every few days; he snorted and said, “You might as well be smelling them.” And maybe I was overdoing the gradualness of my taper, but it worked for me.)
I don’t think Whitaker was going full-Thomas-Szasz and denying that depression is a medical thing, but rather arguing that depression is vastly over-diagnosed (because/and therefore) SSRIs are vastly overprescribed. If lots more people are getting diagnosed than should be, might that help explain the feebleness of these results?
If you’ve read Whitaker and found it bosh, this would also be interesting to hear.
How does this interact with "number needed to treat". That seems like a better metric to look at for effectiveness in this case. (Of course even with a NNT of 100, for that 1 person out of 100, it could be life changing.)
Thanks Scott! I missed this the first time looking at the pictures.
These are actually lower (better) or about the same as I would expect. I believe the NNT for for warfarin to prevent stroke is something like 25. So an NNT of 10.5 for B isn't so bad (unless I am reading the chart incorrectly).
Thanks for this very informative post. I think it highlights the risks of allowing experts to issue binding guidance on clinical practice. We still have a lot of people (including some doctors) who think NICE is the standard of excellence we should try to implement.
Suppose there was a doctor patient conversation that went something like this:
(After trying an SSRI)
Patient: I could feel a strong defect after 4 hours.
Doctor: SSRIs don't work that way. Some drugs have an effect that rapid, but SSRIs don't. How would you feel if I said that was placebo effect?
Patient: yes, it probably was. I am well aware of the arguments that the effects of SSRIs are mostly - possibly entirely - down to the placebo effect.
Doctor: I am really very worried by that
======
Question: why would the doctor be worried?
An observation: some people get a fast reaction from SSRIs. Current medical orthodoxy seems to be that this is entirely placebo effect - which kind of implies that the therapeutic benefit is almost entirely placebo in these cases.
How well do most of these studies adjust for the fact that there are likely huge pharmacogenomic differences in how these drugs are metabolized and therefore in the actual drug concentrations ("actual dosages") between individuals? Do they test everyone? If not, isn't this a really big source of potential confusion?
This article mostly made me think that the claim "this drug has a meaningless effect size" is almost always going to be true. You have a number and you're calling it the "effect size", but the term doesn't mean anything and you don't know why you're doing it at all.
Two things leap out at me from this article. I can't be sure that they are really problems, but they make me queasy.
1. The amount of improvement measured for our hypothetical patients is based on the HAM-D scale. The points on this scale do not have any associated meaning. If you changed the HAM-D instrument to report different numbers, in a fully deterministic way -- the questions would be the same, the answer options would be the same, only the reported numbers would be different -- it appears to me that the measured effect size in the new "rescaled" HAM-D would be different from the measured effect size in the same set of surveys scored under the original HAM-D. Imagine that all scores below 10 are doubled, while all scores above 10 have ten added to them. (Old 7 becomes new 14. Old 24 becomes new 34. Why would we do that? Why not? This isn't much different from adding ten questions that are highly duplicative of existing questions.)
But the actual improvement we care about is in the different survey answers, not the numbers we arbitrarily assign to them. If our methodology gives us different effect sizes for the same set of answers based on nothing but the *labels* that we give to those answers, that's telling us that we have no actual way of assigning meaning to the effect size we measure.
2. In principle, the standard deviation is a statistical construct that is equally well defined regardless of what your probability distribution looks like.
But the conclusions you'd want to draw from a measurement of standard deviation are radically different depending on what your probability distribution looks like.
I tend to suspect that general rules about what effect sizes should count are formulated with a normal distribution in mind. I don't think they will translate well to other distributions. Here, on our scale from 0 to 54 in which, by definition, a large majority of people score in the range 0-7, we have a very odd distribution. It is certainly not normal (or close to being normal); it's also unlikely to be similar enough to the score distribution of any other instrument that a lesson from one could be usefully applied to the other.
When all the Kirsch stuff came out I looked at the data myself, but have not since. But what I noticed then was 1) for people with SEVERE depression it was clear that fluoxetine (at the time the most prescribed) worked extremely well. Kirsch would continually omit this fact. 2) the HAM-D seems faulty in that it is too wide a net. It could be argued that people who would be categorized as "mild" or even "moderate" may not actually have major depressive disorder. If the measure is faulty, all results coming from that measure are "off." Let me add that veterinarians prescribe these drugs for animals (all the time, for a wide range of animals) for a reason. They work. If they work on animals, that eliminates arguments about the placebo effect.
I don't think we should care about effect size at all outside other than assessing:
1) Does the benefit outweigh the risk (large risks should only be taken if there are large benefits)?
2) Is it greater than placebo? (Because then there is a clearly lower risk treatment for more benefit)
Especially if the benefit of a treatment is *additive* with other treatments, then many small improvements could be a big improvement for a patient. This is why I'm annoyed when things are dismissed for "only" improving a patient pain score 1 point on a 1-10 scale. If you can stack 3 of those it's a huge QOL improvement.
Could there be a "garbage in, garbage out" effect, wherein effect size calculations are only as informative as symptom measurements?
Somewhat related, in the comments of a previous post, I semi-seriously challenged you to forecast the effect sizes and response rates of the hopefully-published-this-year phase iii intranasal s-ketamine, racemic ketamine, and r-ketamine monotherapy trials - how well do you think you'd be able to forecast them?
"If standard deviation is very high, this artificially lowers effect size."
I'm just here to say that the point of an effect size is to get a standardized estimate of the effect. ***Dividing by the SD is a feature, not a bug.*** For all the other stuff (ITT, treatment effect heterogeneity, etc.), as with all analyses, GIGO (also I don't strictly mean "garbage" for all research considerations, more precise would be that it limits either the generalizability or specificity, either way reducing usefulness).
Levi 501 jeans are a pretty good cure for lower-body nakedness. Say they come in 50 waist/leg combinations and we have no insight into matching combo to patient. So for any given combo which we try on 50 random patients it will fit say 3, do at a pinch for another 7 and be useless for 40. Contrariwise, a large towel which can be wrapped round the waist will be an inferior cure but will sort of work for everyone, and will score better in tests than any given levis combo, and therefore than levis in general.
This is why good practitioners ask about patient experience and change the prescribed drug if it doesn't work for the individual patient. If a drug has strong effects for some people but not for others then this is the rational way to prescribe, regardless of clinical trial data. Adding together multiple populations and using a single statistical measure to try to describe the mixture is just throwing away good data on some unholy altar of "best practice".
"There’s no anti-ibuprofen lobby trying to rile people up about NSAIDs, so nobody’s pointed out that this is “clinically insignificant”. But by traditional standards, it is!"
It seems that in diseases where symptoms are relatively distinct and measurable, standards for evaluating efficacy are somewhat relaxed. Whereas in disorders where symptoms are fluid and difficult to quantify, standards for evaluating efficacy of drugs is high. This would seem to make some sense if we consider the fact that it's easier to make false effect claims in the latter than the former case. Hence, the intuitive upping of acceptance threshold.
Leucht et al.'s finding raises the question of trusting the observations of individual physicians. The number of patients any individual doctor sees is clearly too small to see consistent effects for many treatments; otherwise there wouldn't be the need for much larger studies (leaving aside the benefits of eliminating bias in RCTs)! I'd imagine many physicians probably use a kind of null statistical hypothesis testing approach in their assessment, which will have a null hypothesis of zero that needs to be rejected by strong evidence to the contrary. Some doctors will see solid effects while many will see zero to "small" effects around 0.3 and for the majority, a small real effect could be swallowed in the noise with most physicians being able to reasonably argue "I see no reliable effect" even if there is a small non-zero effect. That said, we probably don't want physicians to stop trusting the evidence of their eyes, which could be dependent on factors present in that physician's local patient population. What's the right balance?
Are there tools that provide physicians easy ways (i.e.., easier than run your own R models, which few non-academic doctors will do) to track their own patient outcomes and to test whether their observed outcomes are in line with expected effect sizes? This could give physicians a sense of "my patients tend to generally do better or worse than the reported effects," possibly indicating problems with prescription preferences or patient population differences from the general population. Do many physicians *quantitatively* track their outcomes across patients? I'd guess many don't.
Going along with the question of how easily a doctor could detect effects in a patient, how strong of evidence is needed to justify keeping an individual patient on a given SSRI vs. trying another one? Not being familiar with psychiatric guidelines, I'd be curious to know what the recommendations are.
Many thanks for your post. Statistics are wonderful things but ultimately are the drugs worth having especially if the trials are sponsored by big pharma?
I have had a few medications myself. Did you know that 'medication' is the anagram of 'decimation'?
I did read one of your links. Yes, doctors are not as knowledgeable as they could be about diet. They still know far more about diet than the average person on the street, though. You seem to be conflating individual mistakes, which are common, with bad research. There are meta analyses showing that masks reduce transmission from infected individuals to healthy ones.
You're throwing out quite a lot of evidence created by tens hundreds of thousands of people in different countries. Not even the soviets rejected germ theory. They used bacteriophage to treat typhus in their army.
> Statisticians have tried to put effect sizes in context by saying some effect sizes are “small” or “big” or “relevant” or “miniscule”.
I'd like to clarify that, for the most part, statisticians have been trying to explain that this is a bad idea to doctors who refuse to listen and keep putting arbitrary labels on the effect sizes
Thanks, that was very informative and cleared up a lot of things for me about the SSRI debate.
IIRC it’s very common for depressed patients to have to try several medications before they find one that works? If you had 5 drugs that each “cured” 20% of patients running a study of each against a placebo would show little impact. That would be the case even if cycling a patient through all five would cure everyone.
Wouldn't the order in which the meds were tried have a big effect on that? You could get unlucky and the 5th med you take is the "right" one while someone else takes it the first time and it works.
Prior expectations with the potential to drive (variable/inconsistent) placebo effects could be reduced after a few failures to find a successful cure...
Yes. And we're talking about something that tends to be quite chronic and in many cases is not an emergency. So, assuming you're a patient who is being treated willingly in an office, as long as the side effects aren't too severe, being able to go through a bunch and find the one that works for you is a lot better than not having that option because each of them is too unlikely to help. (The costs are higher if coercion, hospitalization, or both are involved.)
Moreover, as Scott noted, the placebo effect is strong. If a drug chemically helps 20% or even 10% of patients, that's enough hope to get a placebo effect. And given that we're talking about depression, a placebo effect may be just as good as a chemical effect.
I recall a joke that may have some basis in fact that the require for double blind studies prevents the marketing and sale of quite a few very effective placebos.
No loss, if the placebo effect is regression to the mean.
>given that we're talking about depression, a placebo effect may be just as good as a chemical effect.
Amen
>as long as the side effects aren't too severe
I mean, this is kind of the kicker, especially for SSRIs. I know several people for whom either the sex drive effects or weight gain effects (especially for afabs, many of whom also have weight-related dysmorphia as a component of their depression) made them voice preference for depression over the side effects.
Given that a placebo effect is so strong, one might wonder if psychiatrist-prescribed-literal-placebo with zero side effects might be worth trying for some patients. (Which would also help rule out nocebo side effects?)
How do you get informed consent to a placebo (outside the context of a study)?
You could tell the patient that they're getting a placebo - IIRC some studies have found that the placebo effect still works even if the patient knows they're getting a placebo, which seems kinda wild.
Does it work on patients who know the meaning of the word "placebo"?
Yes.
What does "one that works" mean?
Does it mean "affects the patient's symptoms"? Does it mean "maximally affect the patient's symptoms?" Does it mean "maximally affect the patient's symptoms and produce a minimum of unwanted side effects?"
That is an honest question.
Excellent point! I don't know the answer, but it prompts me to wonder about a related question: If cycling patients through several candidate medications stops at, perhaps, a good-enough drug, how often does clinical practice stop prematurely, missing a better drug for that patient?
At a guess, all the time. I suffer terribly from depressive episodes. I have hit on a drug therapy which sort of fixes them but leaves me (I suspect) depressed by most people's standards - as in, I no longer spend all day thinking of ways of killing myself, but I never lose the belief that it would be better if I had not been born. State 2 is so much preferable to state 1, and so many drug therapies have been completely ineffective, that I am not prepared to risk relinquishing the existing therapy for a possibly better one.
That makes sense. Best wishes, and thanks very much for the comment!
Yep, you see the same thing with other drug categories, including allergy medications (some percentage of the population responding to each of Allegra, Zyrtec, and Claritin, and some not). As Scott notes, it's only depression meds that attract such controversy and ire for exhibiting this function.
2 posts in one day? Looks like inflation is finally over.
I think the stronger way to put this research is that a 0.30 effect size is consistent with curing roughly 30% of patients.
I've long believed that the easiest way to trick people is to post a number without context.
"Dihydrogen monoxide is LIFE-CHANGING! It has an effect size of .30!"
"Dihydrogen monoxide is WORTHLESS! It has an effect size of only .30!"
...to an uneducated layperson, both of those statements look pretty persuasive!
(It reminds me of the Fukushima disaster, when the public suddenly had to make sense of a bunch of unfamiliar nuclear physics jargon. I remember a friend saying, "if you'd told me last week that bottled water had a high becquerel rating, I would have thought it was a good thing.")
Yes completely agree. A number can be thrown in to add some sense of authoritative reasoning to what is purely an emotional statement. You'll often see words just like that, "life-changing", "worthless", either as-is or as selective quotes, "Scientists say...", and if you want to really get buy-in to your campaign or whatever, it's emotion and not facts that get peoples attention. That's even before the poor person has time to determine whether or not the statement has any basis in fact, ping, suddenly it's the emotion that takes over and now you're part of this tribe of the righteous quest, and all you need to know is that the baddies are in that direction.
Brains are weird, I say. And so squishy.
As an uneducated layperson myself, the phrase "an effect size" is offputting and neither statement results in anything but confused annoyance.
The thing that used to get me was the difference between aerosols and droplets. I assumed it was very significant based on context clues, the fact that there even was an distinction. But macroscopic sand particles can float hundreds of miles on air current, so there's really few flecks so heavy they can't possibly float through the air in some quantity given the right circumstances. There was a similar issue with perfumes and colognes. Some molecules were labeled 'too heavy to fly.' That tended to be misleading.
That is reminiscent of the debate over Covid and mask-wearing. Skeptics would scoff that a virus on its own, being so small, would float freely through a standard mask like a bumble bee flying through an orchard. But as viruses are usually carried in water droplets, which are significantly larger and possibly charged, masks were hopefully more effective at blocking those, in either direction.
True. And masks did seem to be pretty good at blocking outgoing particles for exactly that reason. They could *potentially* be used to block incoming particles, but people's mask-related hygiene tends to be pretty awful without training, to the point that benefits against pathogens in general in some studies ( not covid specifically) ranged from slightly positive to harmful. Though meta analyses showed slight benefit to the wearer. People break the seal with their finger, fold masks up and put them in their pockets, wear them on their chin then put them back over their mouths, pressing pathogenic particles against their mouth, etc.
I had particularly encountered the issue of aerosols and droplets with face shields as replacements or helpers for masks. Face shields slowed the spread of droplets and helped protect wearer's eyes, but droplets and aerosols eventually diffused around the shield.
It's a pity that face shields weren't more helpful in preventing spread from infected individuals to others because face shields were more comfortable and didn't hide a person's face. There were people who were comfortable with shields who didn't like masks.
Face shields are a waste of money unless one wants to look ridiculous.
Still they made some people a lot of money I suppose like the unnecessary masks.
Masks do, demonstrably, protect others if the wearer is sick.
Face shields are worn by some medical professionals so they're not completely useless. If nothing else, they make it less likely you'll get something in your eye. (So will glasses, to some extent.) But they may not make for good cost-benefit in normal life. And shields are not a replacement for masks. And yes, some people are very self conscious about wearing them, and for good reason, but that wasn't part of the question for me.
Masks reduce dust, sand and diesel smuts for example but will not protect someone from the 'flu. This is because the 'flu is an internal poisoning and not some microscopic mutating bug.
https://baldmichael.substack.com/p/m-is-formasks
Face shields will reduce likelihood of getting dust of flies in one's eyes for example and I will wear safety glasses or sunglasses if necessary while cycling. A face shield is good for arc welders and disc cutting.
But they are plain stupid for the 'flu. Many medical professionals are among the most stupidest people on the planet today as by and large they do not understand what the 'flu is.
I had to research this properly to find out what was really going on. I was undergoing immuno-therapy in 2020 with foolish mask wearing nurses and one even had a mask and a face shield to attend to a lady's feet whilst the lady was having her chemo or immuno-therapy.
I never wore a mask because some idiot doctor or nurse told me I must because of COVID 19 which is the 'flu re-branded to make big pharma etc more money.
I went through my treatment inc. April May 2020 when the supposed crisis was at its peak so when the NHS started saying masks were required I knew it was rubbish. In fact the whole thing was guidance as per the gov.uk website but most people didn't check.
https://baldmichael.substack.com/p/what-is-the-flu-aka-covid-19-and
Masks will help against sand, dust and diesel fumes to a degree but not against the 'flu or COVID 19 as it is now known after re-branding to help big pharma etc. make more money.
The 'flu is an internal poisoning as I explain here.
https://baldmichael.substack.com/p/what-is-the-flu-aka-covid-19-and
Virology is fundamentally flawed and what is called the virus is in fact the exosome part of the body's defense system so friend not foe. Sub-links on this etc. in my Covid 19 Summary link within post above.
This is a big problem with studies of infertility. If you have a big pile of people with infertility/miscarriages and you suspect you have a big mix of underlying, poorly understood problems, how good are you going to be at detecting something that makes a big difference for a medium subset of that grab bag?
I strongly suspect that progesterone support in the luteal phase and beyond might make a difference for women whose main problem is low progesterone. But I also expect that low progesterone is an indicator for other problems which aren't fixed by treating the progesterone. (I went on progesterone injections for a week or so for a baby who turned out to be ectopic).
So how do you check whether you're making a difference to enough women to recommend the pills or shots when the numbers are low and you don't know why?
Great post and useful to clarify what effect size really entails , combines well with other discussions of diet effects
This is excellent, thank you.
ACX wrote: "Ibuprofen (“Advil”, “Motrin”) has effect sizes between from about 0.20 (for surgical pain) to 0.42 (for arthritis)."
I'm a pretty healthy retired person. The only "medication" I take regularly is a daily multi-vitamin.
I want to point out that ibuprofen works like a Magic Bullet whenever I need an analgesic. I haven't had any pain in the last 20 years that ibuprofen didn't relieve (and quickly). So I'm very surprised by ibuprofen's low effect size.
Well the effect size is being reported "for surgical pain".
You aren't using it for that. (For non surgical pain, I'm with you it's pretty effective - I'd say way more than Tylenol.)
This. I'd expect that patients with severe pain like surgical pain can be would consistently report that Iboprofen reduced their pain somewhat but not enough.
I note that this is a different *kind* of low effect size than antidepressants, where anecdotal report is that aa minority of patients report large improvement while the majority report no effect.
Yes, I believe that tylenol and ibuprofen are effective at reducing the needed dose of opiates in severe pain even if they are inadequate by themselves.
Evidently my brevity caused you to arrive at an incorrect conclusion about why I use ibuprofen. To be more specific, I don't use ibuprofen for headaches, muscle aches, joint pain, hangovers, or what-have-you because those very rarely occur to me. I didn't even take ibuprofen during my mild case of COVID last summer, despite a fevered day & a half.
The most common times I've taken ibuprofen in the last ~20 years has been after oral surgery. Does that count as surgical pain? I'd say so. I had a wisdom tooth removed in a bloody extraction (crossed roots) nearly 20 years ago, and 400 mg of ibuprofen handled the post-op pain very nicely. (I passed on the oral surgeon's offer of a scrip for Percocet.)
More recently, I've had 3 dental implants placed in the last 5-6 years. The tissue damage was usually slight in those procedures but, still, the periodontist extracted teeth and then drilled holes in my maxilla. When the doctor asked what I preferred and when I told him ibuprofen, he wrote a scrip for 600 mg tablets. One of those following his treatments and I was always good to go.
Still standing by my surprise at that 0.2 effect size, given my specific uses.
In my experience, postsurgical pain from oral surgery is drastically less intense and disabling than postsurgical pain from abdominal surgery. It's more comparable to nonsurgical acute pain like a broken bone or a bad scrape than to having your internal organs rearranged.
I personally prefer NSAIDs because I hate the way opioids make me feel, so I'm not disputing your overall point that ibuprofen is effective. But the day after an extraction or a dental implant, I can pop 2 ibuprofen and be good to go about my normal activities. The day after a bowel resection, prescription-strength IM Toradol barely touched the pain; I could barely walk 10 steps to the bathroom in my hospital room. On day 3, when I still wasn't going for walks or making progress on my lung capacity exercises, my doctor gave my Dilaudid clicky button to my boyfriend and told him to dose me when I looked like I was in pain. As much as I hated it, I have to admit it was more effective.
I agree with your level-of-pain comparison for abdominal vs. dental surgery. My wife's been through a bowel resection surgery, and even a laparoscopic procedure (like hers was) is no picnic. Anyone would prefer dental surgery to that experience.
If we look at the discussion below, the following question comes in mind: how many people claiming that ibuprofen works like a magic bullet actually experiences only placebo effect?
See my reply to JDK (https://astralcodexten.substack.com/p/all-medications-are-insignificant/comment/16768009) and judge for yourself.
FWIW, I have taken other analgesics that did not relieve my pain as I'd hoped, so no placebo effect from those.
I am not doubting your story. I am just saying that there is a chance that for many people who claim that the medicine works including ibuprofen, it is due to placebo effect.
Today someone wrote that he felt cheated that the pharmacist in Latvia sold him a homeopathic product for conjunctivitis. Many people defended the pharmacist by saying that for them this homeopathic product worked wonderfully by quickly clearing the infection. How can we reconcile their stories with the fact that the homeopathic product is basically water? Conjunctivitis tend to resolve by itself in most cases in a few days with or without any antibiotics. But the beliefs that people have are much harder to change.
Well said. What is in effect correlation is not causation. I now consider that dehydration more of an issue and neuro-toxic drugs like ibuprofen are pointless. If they are taken with a glass of water then who's to say the water isn't the cure.
Big pharma won't like people saying that as it is bad for its business.
Maybe this is a problem of the tails coming apart. The tests for depression are measuring a variety of correlates. The baseline for each of the correlates may be higher than we as a society would prefer to admit. If a medication brings down one or two of the correlated variables to baseline, that still doesn't result in a depression cure according to a test that measures other things also.
To me, the height example indicates that effect-size isn't measuring what we want to measure. What we actually want to measure is whether a drug meets our expectations. E.g. 3 inches from wearing heels is "small" when compared to total height, but large when compared to our expectations. In the case of drugs, the "total-effect" is comprised of 10,000 factors on your mental state, one of which is a pill.
Unfortunately, I can't think of an easy, rigorous way to operationalize "measure against expectations". Except to maybe just contextualize effect size against other similar studies.
Have you tried taking the hamilton? What if a pill makes it easier for me to stay asleep but makes it harder to stop perseverating about killing myself? What if it increases my tearfulness but also increases my appetite?
I'm not saying Split Tails isn't true. I'm wondering if Split Tails is the appropriate level of analysis.
In the least convenient world, suppose that the correlates for a subpopulation in Shambhala are perfectly correlated. Does this resolve the paradox? I.e. do the effect-sizes of anti-depressants match Scott's intuition?
I'm leaning toward "no". Because however correlated the correlates, the effect of the medication on mood might still be swamped by the noise of other factors. E.g. what the subject had for breakfast, what the weather is like, whether they have a meeting scheduled at work, what rush-hour traffic was like, etc. For medication to have an effect that's noticeable when judged against "total effect", it would need to be strong enough to overwrite other factors. And I sort of doubt that Alexander's "Terrible, Horrible, No Good, Very Bad Day" can go from melancholic to ecstatic in the absence of recreational drugs.
What I'm suggesting is: if we were expecting SSRI's to be able to swing up to 100% of our mood, maybe our expectations were unrealistic. Or if we *demand* meds that can swing mood up to 100%, maybe recreational drugs like MDMA and psilocybin shouldn't be illicit. Of course, few doctors are willing to prescribe these. But then it should no surprise when the effect-sizes of "meds that don't rock the boat" seem low.
I think it is astonishing that millions of people have taken these drugs, that are supposed to affect something as significant and obvious as mood, and there is a debate about whether they have any effect at all. So my inclination is to suspect they do indeed have little effect. (This is also "supported"/prejudiced by personal experience/anecdote.)
The other thing that I draw from this is that statistics is hard, and very few people know how to do it properly.
I think there's a debate because millions of people who have taken them are common-sensically very sure they have an effect (including me!) but the studies kept showing they didn't.
Well, SSRIs have been used extensively for 40 years and the suicide rate today is HIGHER than it was back then. If psychiatry continues tu use the same drugs why are the results going to change? Maybe psydelich medicine could be more successful. Ketamine is widely used (still no endorsement from the APA in spite of the failure of the monoamines). I read that MDMA may be approved soon (why, considering the suicide and overdose catastrophe it hasn't been approved yet escapes me).
No rct every that shows reduction in suicide competition.
I think there was one medication in Finnish study that show actual reduction in suicide attempts but sadly also showed an increase number of suicide completions!
But suicide completion by depressed people is such a rare event that you would have to have a huge number of subjects to catch a reduction. Lifetime prevalence of suicide completion is about 4% (higher for men, lower for women). If we assume the average life span of a depressed perrson is 50 years, and the study followed people for 5 years, then we'd expect suicide completion to be 0.4% of the sample. And you have to have quite large numbers of subjects to get enough suicides in the treated and untreated groups to statistically compare them. If you have 1000 subjects in treated group and another thousand in placebo group, you can expect about 4 suicides in each. Even differences that look notable to the naked eye (say 2 in treated, 5 in untreated,) are not going to reach statistical significance.
Most of the uptick in suicide is at the very beginning of anti-depressant treatment. So that eases the burden on researchers somewhat.
Stating the unstated here, the popular theory for why this happens is that the anti-depressant works or has a placebo effect that mimics the same which gives the person greater physical energy and self-efficacy. This is what they need to get up the resolve to successfully attempt suicide in their still depressed state.
Suicide is a rare event?
4% is not rare.
0.0004% is rare.
What a "rare" event is depends on your ability to distinguish it. If you can analyze the decays of a million muons per hour, 10^-7 might not be particularly rare. However, humans are larger than muons (citation needed), and the logistics of medicating a million humans per hour and studying their life outcomes are significantly more challenging.
Suicide is rare but overdose deaths are not, and there is a very thin line between the two. And overdose deaths have never been higher. The monoamine hypothesis has failed, it is time for something new.
By “overdose death “ do you mean accidental overdoses of antidepressants? Or opioid overdoses?
Aren't you ignoring at least one other explanation for why overdose deaths have never been higher? (i.e. that fentanyl is much easier to overdose on accidentally than its forebears.)
I assume that is "even" and "completion."
2 points: how many studies have suicide as a primary outcome? second, there is a sad but actually quite persuasive argument that there is such a thing as being too depressed to attempt, and/or too depressed to succeed in, suicide. so an *increase* in suicide in the early stages reflects the AD improving the patient's capacity for goal-directed activity.
Is there any evidence to support the speculation "that increased suicide attempts" are a result of of "improvement in depression"? I don't find that argument persuasive at all.
If "too depressed to attempt suicide" is a thing then one should be able to identify and predict that behavioral milieu. Perhaps you have research to support this, which I'd be happy to read.
No research, but from personal experience things like agency and initiative improve coming out of depression, well before affect. So it seems plausible to me.
Patient and friend of others.
I am not going to the stake over this theory but it is the best solution I can see to the apparent paradox that ads come with increased suicide risk.
Some other things have happened in the last 40 years too. As another commenter said, statistics is hard.
The suicide rate of a given country varies widely, mostly due to factors other than psychiatric medications.
Speaking as a depressed non-suicidal individual, I do not think that measuring the effect of depression in suicides is a good idea. I guess I would trade ten years of depressed life for eight years of non-depressed life, which would indicate a loss of 0.2 QALY per year due to depression. That might add up to be a significant amount compared to the expected QALYs lost due to suicide.
Also, suicide rates feel way easier to Godhart than depression rates. If you tell people that they will go to hell for suicide, but to heaven if they die during a crusade, and have a high risk crusade ongoing, that should be sufficient.
Thank you for your reply. To me the problem with this Danish study is that drugs like ibuprofen, Ambien, Ritalin, benzodiazapines seem to have indisputable effects. And it seems that all psychoactive drugs with indisputable effects are controlled substances (more controlled than ordinary prescription drugs). There's a huge need to doctors to be able to "do something" with the many patients who come complaining of depression, and SSRI's are basically harmless (in the eyes of most medical professionals) pills that can be handed out to nearly anyone. I mean, they are worth trying, right? (And that is basically my opinion of them, and in any case I don't want to influence people not to try treatments, and this is not medical advice, etc.)
I am a doctor too. I don't work in a field where I prescribe drugs to people any more, but I worked in general medicine for a while. I guess the little clinical experience I have from patient reports suggests SSRI's have mild effects (but on the other hand, people tend to come to professionals when they are at their worst, and depression is cyclical, and then there is the placebo effect, and the pressure people feel to say something helped a bit). So I don't know.
I have also taken them for depression myself, and they didn't have any appreciable effect on my mood. I would frankly expect them to have stronger effects on non-depressed individuals as well, if they are actually effective drugs.
Again, this is just the direction my own thinking has taken me.
Not sure why you think ibuprofen has an "indisputable" effect and antidepressants don't. Some people don't respond at all to ibuprofen. Others respond great even though it turns out they were in the placebo group. You get the same muddled results with antidepressants, where some people notice an "indisputable" effect and others don't see much improvement. What's the difference?
Well, there is no debate about whether or not ibuprofen is an effective mild painkiller. (And it also lowers fever, which can be measured.) Another thing is that ibuprofen works pretty soon after it is taken, whereas antidepressants have this odd lag of a few weeks.
When you say "some people don't respond to ibuprofen", do you mean there are people who don't respond at all for any condition at all? The percentage of such people would seem to be low, compared to the number of people who don't respond on antidepressants.
> When you say "some people don't respond to ibuprofen", do you mean there are people who don't respond at all for any condition at all?
You can't possibly study that, because you can't give someone every possible condition and then see which ones ibuprofen helps with. Studies will usually just look at the efficacy of ibuprofen for a single condition.
Studies often find fairly low efficacy for ibuprofen. Here's a quote from https://www.cochrane.org/CD011474/SYMPT_oral-ibuprofen-acute-treatment-episodic-tension-type-headache-adults:
> The outcome preferred by the International Headache Society (IHS) is being pain free after two hours. This outcome was reported by 23 in 100 people taking ibuprofen 400 mg, and in 16 out of 100 taking placebo. The result was statistically significant, but only 7 people (23 minus 16) in 100 benefited specifically because of ibuprofen 400 mg.
> The IHS also suggests a range of other outcomes, but few were reported consistently enough for them to be used. People with pain value an outcome of having no worse than mild pain, but this was not reported by any study.
So it helped 7% of people lose their headache within two hours.
Here's another study that also shows the results for having only mild pain after two hours in people with migraines: https://onlinelibrary.wiley.com/cms/asset/b41ea13a-8643-420f-9226-07c6b96e7c7f/ejp649-tbl-0002-m.jpg
NNT is 3.2 for 400mg of ibuprofen, meaning it helps about 1 in 3 people reduce their migraine to mild or no pain within 2 hours.
This isn't a bad result for a painkiller. But just like antidepressants, there are a lot of people who don't respond.
Some people for metabolic or genetic reasons don’t respond to certain medications.
”Indisputable“ doesn’t mean it helps everyone or helps adequately.
I note that these measure would ignore a reduction from "severe" to "moderate" pain; I don't think you can take that measure and say that everyone who didn't have their pain totally removed is a non-responder to the drug.
Ibuprofen has no noticeable effect on pain for me.
It is very likely to be so.
For me codeine has very little effect. In the past it used to be given to children and sometimes children reported that it doesn't work and doctors tended to disbelieve their words. Now we know that a certain population (actually large part in some countries) are slow codeine metabolizers in pathways that are needed for codeine to work.
Anecdatal evidence, but I've noticed that there are some headaches I get which ibuprofen solves nicely, and others which ibuprofen does basically nothing for (or which I pattern-match to the no-solution type and therefore don't take anything for). My wife is like this also, but I'm not sure I could consciously articulate what the difference is between the two categories. Migraines I think fall into the latter.
But if placebos are also pretty effective, how much of that perceived effect is itself placebo? In other words, are you very sure that (SSRI effect - placebo effect) is significant?
Of course, one can argue that, clinically, we don't ultimately care about (drug - placebo), just about (drug), because if the patient gets better, mission accomplished. But it still leaves open the question about the drug.
(drug) has lots of nasty side effects, and (Placebo Pills TM) have the advantage that it's fairly easy to have them *not* have lots of nasty side effects, so if all the benefit is placebo it would be a real and meaningful harm reduction to prescribe 'homeopathic' versions of the drugs
Unless the side effects are driving the placebo effect. Something like "feel bad" -> "take pill" -> "feel side effects" -> "pill must be working" -> "feel better through thinking the pill must be working rather than because the pill is actually working"
But the placebo half of these studies are (presumably?) not getting a placebo with side effects. Even if the effect of the SSRI is entirely the sort of placebo effect you are describing, the net result should be “better than placebo”.
ETA: Or, if they *are* using placebos with superficial side effects, like a bitter taste or whatever, and this is why the SSRI does not do appreciably better, then Thor is right and we should be studying different placebos.
Regarding the placebo effect, how do you explain that many people find that one SSRI helps them and another doesn't? It seems that the placebo effect would be equal in this scenario.
This isn't much of a puzzle. Both placebo effects and SSRI responses are highly individual & variable, with different subpopulations doing different things. At the overall population level (say, in a large RCT), it can *both* be true that placebo effects are a large portion of the response *and* particular SSRIs are effective for particular subpopulations.
But placebos have an effect too. And they (the placebos) might have even more if they tasted bitter ( unless placebo tastes like proposed treatment it may not be blind test.)
It's always benefit and risk calculus.
Placebos have benefit but nearly zero risk. But many commonly prescribed rxs have small benefit but some risk.
I don't think there is one RCT that shows any anti-depressant or anti-anxiety actually reduces suicide completions.
So there is a question of proper endpoint especially when effect size for non endpoint is so small, but risk is there.
An rx that cures suicide taking but has no or little effect of feelings of depression is better than Rx that has no effect on suicides but has some effect on feeling of depression.
I disagree with that last sentence, actually. For values of 'some' it might be true, trivially, but "feelings of depression" are a real QoL destroyer and curing those feelings is very valuable. Especially if the drug that 'cures' suicides is removing ability rather than desire to seek death, all it's really doing is buying time for other treatments to fix the depression.
(Stopping suicides no more cures depression than giving a person a wheelchair cures paralysis; it's a meaningful symptom being treated but it is far from the whole issue)
QoL is a pretty squishy thing. It is not well operationalized.
Should we seek to increase QoL of course but increasing QoL without preventing suicide (and for some of these medications in some situations increasing suicides) is no good.
This seems obviously wrong? Both preventing suicides and increasing QoL are desirable endpoints, and if there's a trade-off between them it gets more complicated, but a drug that takes a population of 1000 depressed people and makes 500 of them no longer depressed without at all changing the suicide rate is a very useful drug.
But this discounts potential harm that accompanies pharmaceutical intervention.
"Depression" that doesn't include risk of self harm may not need treatment at all.
The aim of medicine may not be maximizing hedonic value. Do wrinkles need treatment?
I think there is a world of a difference between "having a small but significant effect size" (SSRI) and "having no measurable effect size" (homeopathy).
Amen.
I went through a bad depressive spell last summer during my divorce. I could not shake it, and it was affecting my life, work, parenting, etc.
SSRI's are the only thing that allowed me to pull myself and my life together to start making changes.
As you indicate, and as a result, my bar for being convinced that these poorly understood medications do not actually help (and did not help me more than a placebo) is comically high. I'm heavily inclined to believe such a study was flawed or missed something; the complexity and specificity of most of the studies I've looked into does nothing to dissuade me from this view.
The conventional advice for medicating depression is "try a bunch of things until you find something that works."
if we take this to be a good algorithm, it immediately suggests that depression in the population is caused by a mixture of different upstream conditions, and different medications target different ones.
If any 1 drug only targets at most 10% of depression cases, you're always going to see low effect sizes even when the drugs are extraordinarily effective for the people they are suited for.
So, the debate is about any "average effect" (conditional on showing up in a pysch office), and we can easily imagine models of the world where good drugs have "little effect" on the population. Doesn't imply that they're useless.
The closer a drug gets to treating the condition directly, vs upstream causes, the larger its apparent avg effect will be. Ibuprofen seems to be in this case—regardless of the cause, it will help with pain and inflammation.
Not sure this is a good algorithm. People take many herbal remedies that do nothing or close to nothing. Which is not to say that all herbal remedies are useless but many are.
Another example - many bodybuilders take many supplements that do either nothing or almost nothing.
Can someone clarify what effect size metric they are using? Apologies if I’m missing it, I didn’t see it in a quick skim of the linked blog post
It's probably percent improvement of treated over placebo.
It's Cohen's d (https://en.wikipedia.org/wiki/Effect_size#Cohen's_d).
An effect size of 0.5 means the drug produces a 0.5 standard deviations improvement compared to the control group.
So if you were to measure the effect on depression as measured by HAM-D + 2D6 then Cohen's d would be lower because the averages are the same but the sd is bigger? I.E. effect size is a measure of both how powerful the intervention is and how good we are at measuring it and we are bad at measuring depression.
Yeah, that's my understanding. That kind of makes sense though. As we get worse at measuring, the measured effect size goes down. In the extreme, if we're maximally bad at measuring such that all measurements are completely random, the effect size goes to zero.
This is partly what the Danish team in the article is suggesting: that a different measure of depression could increase effect sizes.
Effect size gives a somewhat better idea of whether an effect is large than a simple percent change. Imagine if we found that people in country A are on average 2.3 inches taller than country B. We could report that as people in country A are 3% taller on average. Is 3% a lot? It depends on what you're studying. For height, 3% is a pretty big difference - roughly the difference in height between white male Americans and males of the tallest countries in the world (Netherlands, Germany). In a lot of other contexts, 3% is a very small difference.
Many Thanks! That made it much clearer.
I used to have a similar objection when I'd see stats like, "ambien reduces sleep latency by an average of 15 minutes".
And, it was like... who are you measuring that on?
Normal people fall asleep within 10-20 minutes, so it couldn't possibly make a healthy person fall asleep faster than that.
And everyone else is going to have a wide range of responses. My natural sleep latency varies between 2 and 6 hours. On ambien, it went down to 10-20 minutes. That's a pretty big difference in terms of, say, being able to hold down a job.
If only 10% of people react that way to the drug, it might still be highly useful for some, regardless of what the average effect is.
The value of a drug can depend highly on the response at the tail of the distribution.
That's how I used to think about this stuff, anyways.
Years later, I eventually became dependent on sleeping pills. When I stopped taking them, I went through a horrific withdrawal experience that has lasted 18 months, so far (lunesta is what ended up harming me, not ambien).
I've since gotten involved in recovery communities and spoken to hundreds of people who've been badly injured by psychiatric drugs. Mostly by benzos, but I've met some who've had terrible withdrawal from antidepressants.
That makes me equally skeptical of any studies that might have been done on withdrawal.
I mean, I'm not sure that drug makers do much research on getting people off of drugs, in general. But, to the extent that they do, they're probably going to make a study that says something like, "the average person has mild discontinuation symptoms that last 4 weeks".
That might even be true, on average, but the worst case outcome may be years of suffering, with a possibility of permanent damage.
Both the benefits and risks of drugs are probably seen best in the tail of the distributions, and trying to explain anything about drugs in terms of mean effect sizes can be misleading.
I don't believe the effect size of 0.11 for antihistamines and sneezing. I notice that it was published in 1998, in the pre-fexofenadine age (though other excellent second-generation antihistamines existed). I want effect sizes for Allegra on seasonal allergy symptoms and stat.
To me they made no difference during hay fever season. I envy those for whom it works so well.
I think my doctor doesn't believe me and keeps saying that I need to take them regularly. I have tried again and again with no results though.
All mostly a waste of money which is why over the counter.
Second-generation antihistamines are amazingly effective. They're over the counter because they're mostly harmless.
They literally changed my life. I love them as much as my own children.
I'm a statistician who's done a fair bit of professional work on test design and effects estimation (in a context different from clinical trials). I don't think I've ever found it very useful to normalize the size of an effect to its standard deviation when trying to understand the importance of that effect post hoc. You can use things like signal-to-noise to plan tests, but afterwards, you mostly care about two things:
1. Did the factor (or in this case, intervention/medication) effect the outcome I care about?
2. Do I expect to see that same effect in the future? (In this case, if I give that medication to other folks in the future)
You can round statistical significance up to the second one, but I read this post as being about the first thing. There is no coherent, generalizable definition. An effect size matters if it matters. Expertise and context are very important here. Describing an effect as "large" or "small" within the context of a particular ailment or class of drugs might make sense, but I think I agree with Cuijpers, et al when they say "Statistical outcomes cannot be equated with clinical relevance."
Going a little deeper on that Cuijpers et al (the 0.5 people Scott cited above), they actually seem to suggest that 0.24 is the more clinically-relevant effect size. Their approach of identifying a clinically-relevant different and suggesting that as a cutoff for whether the intervention is useful is consistent with what I've recommended in the past and sounds extremely similar to what Scott is suggesting when he says, "I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience."
Last point: When you're talking heterogenous effects (as in the 'Danish team's second simulation), these notions pretty much go out the window. The average effect of a drug shouldn't be what we measure it on or think about how to proscribe it. We should probably instead think about the likelihood that the drug will have a clinically-significant effect on a given patient or something along those lines. That's not something that will be captured by looking at the 'effect size' in cases where the drug effects some people completely differently than others.
Agreed. I don't think I've ever seen anyone report Cohen's d in an economics paper.
Normalizing effect size is meaningless for calculating utility, but it provides insight on the effect size of unknown confounders. If a normalized effect size is, say, less than 0.1, that could be caused by improper blinding and researcher biases creeping in, without outright fraud. If d > 2, that's no longer a reasonable hypothesis.
Sure, but that goes back to whether you think the effect is real, not whether you think it's important.
In this case it might make more sense to normalize the effect by the difference between the average score of the depressed vs the healthy population. If it gets the average patient halfway to healthy, for example, that seems like it would be clinically relevant.
Absolute effect sizes, not standardized effect sizes, are what you're interested in. We need much more discussion of how big the effect is in those terms (and the necessary associated discussion of how to come up with a rough estimate of how much absolute change on these questionnaires is clinically significant).
Patients want to know how much less depressed a medication will make them. They don't care about whether the drug will improve them by a factor greater than some fraction of the standard deviation of the spread of depression scores in the population!
"Patients want to know how much less depressed a medication will make them. They don't care about whether the drug will improve them by a factor greater than some fraction of the standard deviation of the spread of depression scores in the population!"
Agreed. I'd guess that, ideally, one would want to compare how much less depressed a medication will make them with the disadvantages to them of that particular medication - presumably mostly side effects.
How much of this is just a side-effect of trying to collapse a whole distribution of patients into a single "effect size" score? I'm not a statistician, but it seems plausible that something that works *really well* but only for a few people, and something that kinda works for most people, could conceivably end up with the same score, even though those are very different in practical use. Isn't what we're really looking for something more like the effect-magnitude-vs-patients-affected curve like in the first two charts here? Boiling that down into a single number not only glosses over a lot of information, it makes it far easier to demand "make the number higher" without knowing whether that is even mathematically possible.
I think this is critically important, and not recognized nearly often enough, in medical research on clinical 'syndromes' and 'disorders'.
When you're testing a treatment for a 'disease' with known pathophysiology, you can assume that it should work on almost everyone if it works at all. There will be some exceptions with weird metabolic interactions, and some people will need to discontinue for allergies/side effects, but everyone else should have some kind of positive response, and the question is whether the response is strong enough to be useful/worth the risks.
But if you're treating a clinical diagnosis with unknown pathophysiology, you can't assume uniformity. There could be significant heterogeneity in the underlying biochemical mechanisms of depression in different patients.
Not only might your measured effect size be depressed by the fact that only, say, 25% of patients have the 'right' kind of depression to be helped by your drug, but it might be further depressed if, say, 10% of patients have a kind of depression that gets worse when exposed to your drug. Which would be valuable information, but only when disaggregated; mashing it into the "effect size" stew creates a meaningless number.
Antidepressants clearly help some people some. Recently read a study that was touted as evidence that antidepressants keep on working even after a couple years, though I actually found the results pretty grim: People who'd been on any of 4 antidepressants and felt well and ready to stop were randomly assigned either to really tapering off the drug or to what one might call placebo tapering off: they continued taking their usual dose. (All subjects took identical generic-looking pills.). Outcome measure was how many in each group relapsed. At end of study, 52 weeks out, 56% of those who had tapered off had had depression relapses. (See! That stuff takes a licking and keeps on kicking!). BUT 39% of the people who had not tapered off had also relapsed (oh . . . ). Study is here: https://www.nejm.org/doi/full/10.1056/NEJMoa2106356
Also there's a kind of harm that SSRI's do as they're currently prescribed that I don't see talked about much. Here's a story about an actual patient, aged mid-20's, who came to me for her first meeting and gave the following account: In high school she was a gay kid not out to anybody. She was lonesome and miserable. Went to her MD and described how unhappy she was, but did not disclose her secret. She was put on an SSRI, and believes that it helped at the time. Seven years have passed since then, and she came out years ago and has a long-term girlfriend. But she feels like there's something wrong in their relationship because she has no interest in sex. She also weighs 40 or so pounds more than she did in high school. She isn't exactly depressed, but feels blah and unmotivated. She has tried to come off her antidepressant but when she does she feels terrible within a few days, with flu-like symptoms, irritability, insomnia and head zaps, so she has concluded that she must need the drug. Her health care providers have all routinely continued her SSRI prescription all these years. No one has asked whether she would like to try coming off. No one has told her that the SSRI she is taking is known to have a bad discontinuation syndrome (flu-like symptoms, irritability, insomnia and head zaps) and that if she is going to try stopping the drug she must decrease the dose very slowly.
"Help people" Maybe. But in what way? And how is that operationalized?
Placebos clearly "help people" too.
I'm no big fan of antidepressants (see the rest of my post), but what I have read, which is various clinical trials, various lit reviews and a coupla books) say that in antidepressant trials treatment groups show more reduction in depression than placebo groups, as measured by self-report and clinician-administered structured interviews.
"More reduction in depression than placebo".
But how much more and with what harms.? And without actually reducing suicides.
Perhaps considering number to treat is also important.
Suicide is a bad end point to assess, both very rare so hard to see in the data, and also probably only a smallish portion of the total harm depression does, ie its very bad for you and people around you if you kill yourself, but its also really bad if you are depressed, and if there are ten thousand depressed years of life for every suicide, the ten thousand years probably is a bigger thing in the total utility calculations, even though each suicide is vastly more important than any given year of being depressed.
"Also there's a kind of harm that SSRI's do as they're currently prescribed that I don't see talked about much."
I am a licensed doctor. I always go over typical side effects of SSRI:s when starting them or controlling them, and ask them one by one. Sexual side effects in particular, as it seems to me that most doctors do not pay enough attention to them.
Rare side effects I don't give much room unless there is a specific reason to worry about them, because they are by definition rare, and might discourage the patient from starting a medication which might improve or even save their life. Not treating depression or anxiety with medications also has possible side effects.
So I sort of agree that yeah, a lot of medical practitioners aren't aware enough of (sexual) side effects of SSRI:s, but I'm willing to be the change I wish to see happen. The patient in your story would never leave my appointment without being thoroughly aware of her SSRI's possible sexual side effects or SSRI discontinuation symptoms.
I’m glad to hear of somebody who’s doing this. You should also tell them about weight gain. Last figure I saw was body weight increasing 3% a year. In fact there’s a grim joke about SSRI side effects:
Pt: I am miserable
Dr: Here are some pills that willl make you fat and anorgasmic.
Pt: Perfect!
About the possibility that maybe each antidepressant is one of those drugs that cures certain people 100%. I believe there was a massive study done where each patient could try up to 3 or 4 antidepressants, with doctors instructed to change antidepressants after a certain period of time if a patient's improvement didn't meet a certain criterion. In fact I'm *sure* something like that was done, maybe at NIMH, but I simply cannot remember any details. Anybody here know something about that study?
I think you're thinking of STAR-D
Yes, that's it!
I can think of two factors which lead to such ambiguous results.
1. Is depression an illness or a symptom set caused by multiple illnesses and/or bad habits?
Imagine going to the doctor and being diagnosed with "runny nose disease." Having a runny nose is not psychosomatic or imaginary. It is visible to others and even measurable. But any doctor who diagnosed a patient with "runny nose disease" would probably lose his license.
But imagine if we looked for drugs to cure "runny nose disease". An anti allergy drug would indeed reduce runny nose for those whose nose runs due to allergens. Likewise, avoiding cats would be helpful for a significant number of people. But neither therapy would help against a sinus infection. Conversely, antibiotics would help sinus infections but would do nothing for allergies to cats or pollen.
2. Even given a correct targeted cause, a drug used shotgun style would have mixed effects.
For example, I do not suffer from clinical depression, but I can get depressed if seriously deprived of sleep. Measures to increase serotonin -- such as eating high carb/low protein before bed -- can improve sleep quality in many people, including myself. But I rather like having more protein earlier in the day so I can be awake.
Serotonin boosters can thus be beneficial to help sleep, but also depressing if given during times when wants to be awake. Who wants to be sleepy all the time?
OK, the previous paragraph was speculative, but I can say from experience that I find tryptophan supplements to be useful at times, but also quite tricky. I can take a standard half gram pill before bed and get wonderful sleep -- for a night or two. Then I get bad side effects. A quarter gram dose works better, but even then I wouldn't want to take it regularly.
So, given my personal experience with semi-natural serotonin boosters, I could see where SSRIs could both make depressed people happier AND make them want to shoot up a school or something. Dosage and timing are key.
And some people could be depressed for reasons other than insufficient serotonin at key times.
------
And all this adds up to why I sometimes put more credence to Amazon reviews than hard science. If a bunch of people really like a drug, supplement, diet, etc. then it means the product in question works for SOMEBODY. Standardized scientific testing procedures can average out signals, but for and against.
> I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience.
I know this was not you specifically, but we've come a long way from "shut up and multiply", eh?
Shut up and multiply is about utilitarianism. I don't think even Eliezer ever said you have to believe all studies. See eg this Less Wrong post ( https://www.lesswrong.com/posts/vrHRcEDMjZcx5Yfru/i-defy-the-data ) which I found really helpful in prefiguring the replication crisis and helping me stay sane until everyone admitted that most studies don't replicate.
I think part of the difficulty here is that the language of effect size is trying to put all clinical metrics* on the same footing (by dividing the s.d.). This is a useful metric/way to think about the world in general. However, this abstraction obscures at least two salient characteristics that someone might care about:
1. How large are the differences between people** in absolute terms.
2. How important are such differences to people.
Consider a drug that increases lifespan by 6 months, with minimal side effects. I imagine this is the type of drug that many people care a lot about. But this is "only" an effect size of 0.1 (US mean lifespan 78 years, sd 5 years).
While not as extreme as lifespan, I imagine both mood and sleep quality to be characteristics with a) relatively high individual variation and b) quite important to people. So relatively small effect sizes have large clinical and quality-of-life significance.
In contrast, a drug that treats relatively minor ailments with very small individual variation may need a much higher ES to be worth doing.
Much of the problem here is plausibly due to a lack of an actually comparable common metric (like DALYs), instead using an artificial metric (effect sizes) that gives the *illusion* of cross-comparability. I imagine that this is an area that public health economists may be able to impart useful tips to clinicians.
*and intervention analysis metrics in general
**I was originally thinking of the question in terms of general population distributions, but I suppose what you care about more is the population under study.
One molecule may cure only a fraction of patients, but there are many molecules available, so psychiatrists try one after another until remission. This sequential process has been studied in the STAR*D trial.
Results: "With persistent and vigorous treatment, most patients will
enter remission: about 33% after one step, 50% after two
steps, 60% after three steps, and 70% after four steps
(assuming patients stay in treatment)."
So the whole issue is to find the right molecule for each patient. Clinical trials that test one molecule on random patients cannot capture this.
Source: Gaynes, B. N., Rush, A. J., Trivedi, M. H., Wisniewski, S. R., Spencer, D., & Fava, M. (2008). The STAR*D study : Treating depression in the real world. Cleveland Clinic Journal of Medicine, 75(1), 57‑66. https://doi.org/10.3949/ccjm.75.1.57
Warden, D., Rush, A. J., Trivedi, M. H., Fava, M., & Wisniewski, S. R. (2007). The STAR*D project results : A comprehensive review of findings. Current Psychiatry Reports, 9(6), 449‑459. https://doi.org/10.1007/s11920-007-0061-3
I'm confused how we land on the underlying measurement tools for these things. The HAM-D test you describe seems (from your description, at any rate) very poorly calibrated for this application, like using my bathroom scale as a kitchen scale then wondering why the bread doesn't come out right. Same with the comment above noting that normal people take ~15 minutes to fall asleep making them a horrible reference for Ambien.
Am I missing something? I would have thought basically everyone's incentives aligned toward using tests likely to actually find something, but then what drives the selection of measurement tools for these types of studies? Is it just much, much harder to create good medical diagnostic tools (even for something like insomnia or post-surgical pain) than I'm imagining?
There are two major omissions in this post, as far as I can see: depressed people in the control arms of these studies may not respond to placebo (they spontaneously improve, which is a small but relevant difference) and the drugs listed in the chart as having low effect sizes are well known to practicing physicians to be clinically useless, but persist in practice 2/2 commercial interests and institutional capture.
Depressive episodes are self-limiting in a large majority of patients, meaning that even absent a placebo intervention, in 6-8 months they will start feeling better. This is at least what my (non-american) textbooks said about the issue when I read them for licensing. This is different from a placebo effect in that it makes for an even less productive uphill battle for any intervention trying to treat it, since you don't get to profit from the placebo effect in the intervention or the control group to a meaningful amount. There is no real reason to expect a benefit in a large number of patients receiving antidepressants or placebo to begin with, who maximally may benefit in having their episode shortened by a few weeks, which is difficult to prove statistically and not something assessed by the trials in meaningful way. There was never a study, afaik, to compare placebo vs. open label doing nothing in depression, were it run I'd wager it returned a null. This is a largely unexplored problem of the literature, that is consistently used to inflate antidepressant activity: they are a bit better than placebo and placebo effects in depression are large. But are they really? The literature can't tell, as far as I have read it.
On the later point, there has been endless debate about the efficacy of blood pressure meds, statins and any other tertiary prevention program for exactly that reason - the effect sizes are marginal, the NNT is enormous and the data are largely there to conclude there has been no real benefit to the deluge of prescriptions that started around the 1990s. Yet the guidelines keep expanding indications without good data - it has long been speculated that this is due to commercial influence on the committees and on medical research as a whole. So any argument that basically makes the point "marginally effective cardiovascular drug A has a similar effect size to antidepressants, so they must work" ignores what is basically a consensus amongst the more perceptive GPs - that neither have any discernible effect in actual practice and are sold due to aggressive marketing and distorted perceptions of risk and reward by the professional community. So the findings of the second study aren't all that surprising, since antidepressants end up right where they belong next to the other largely ineffective medications. It's just that this proves Kirsch's point, not his critics'...
A very thoughtful post. Thank you
I don't find the high drop-out rate exculpatory.
The side effects of SSRIs are very similar to the physical symptoms of anxiety, and I believe it's being given wholly inappropriately to people with anxiety. When my mother was outpatient for paranoid delusions caused by sleep deprivation which was in turn caused by anxiety, she was given SSRIs. The single most common side effect of SSRIs are sleep problems, so of course they were wholly inappropriate, worsened her mental state, and she had to be hospitalised. It wasn't until she was given benzos that she was able to sleep for several nights in a row and the problem resolved itself.
When I was suffering from sleep problems caused by anxiety, I was also prescribed them. I suffered from sleeplessness, hot flashes, night sweats, and it made my heart race. The side effects are quite severe and definitely worsened my mental state, though fortunately I don't suffer from the same psychatric sleep problem my mother does, so the consequences weren't as bad. I was given a benzo-like drug (Zopiclone) which I used sparingly (knowing the risk of addiction) and was very helpful to get me through the worst of it. No thank to the SSRIs!
Sorry to hear about your experience but SSRIs have a good evidence base in treating anxiety disorders, particularly at higher doses - https://pubmed.ncbi.nlm.nih.gov/30479005/
You are correct that when starting or changing doses of SSRIs that side-effects can be similar to anxiety symptoms, but these are usually transient (so for most people are worth pushing through)
My point is that the effect size being small "because of the high drop out rate" isn't exculpatory because for this subset of the population SSRIs are genuinely bad. Having to drop out because the side effects are too severe to tolerate is a genuine signal there's a problem with the medication.
I halfway expect SSRIs "work" for anxiety because once you adjust to the dosage, you feel like your anxiety symptoms have improved, merely because the side effects have subsided, not because your original problem is actually better.
I don't think the data support this theory - in RCTs anxiety scores improve with SSRIs steadily over 0 to 12 weeks of taking the drug (at no time-point are anxiety scores higher than they were at baseline).
As medications go, SSRIs are generally well tolerated. Obviously that doesn't mean they don't have side-effects (like all medications) but for most people these are bearable.
For a strict majority of people, it's clearly bearable. But a 1/3 dropout rate is nevertheless really high. I feel misled by claims that the side effects of SSRIs are generally mild.
If the primary symptom of anxiety you are struggling with is insomnia, you should not be prescribed a medication that it itself causes insomnia in 30% of people even though that's a "minority". This is unacceptably high.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC181155/
It seems to me that traditional effect size criteria underestimate how effective antidepressants are, but traditional "how bad are the side effects" standards probably also underestimate how unpleasant they are. That suggests we should be paying attention to the side effects more when prescribing antidepressants, but it also suggests a systematic problem with how we're measuring both kinds of things.
Anecdotally, on the forums I visit, whenever SSRIs come up people who used them for depression are pretty meh about them, and people who used them for GAD are exceptionally positive about them. Lexapro has effectively cured my GAD.
Same here (SSRIs destroy my anxiety disorders). One of my frustrations with the studies discussed here - and in most discussions - is that they focus on depression, and I never hear about effectiveness for anxiety treatment.
The best antidepressants are the oldest ones — the MAO Inhibitors Parnate and Nardil.
I was quite surprised at how long effective treatments for depression have been available. I'd have assumed that until around, say, the 1950s a patient telling a doctor they were depressed would be urged in reply to pull themselves together and snap out of it, or else prescribed aspirin! But one reads in an obituary of mathematician J E Littlewood on page 332 of:
https://royalsocietypublishing.org/doi/pdf/10.1098/rsbm.1978.0010
"Throughout his long life he lost little of his will power or of his mental force and clarity. Moreover, from 1957, he was freed from his earlier periods of depression. A perspicacious psychiatrist, having traced their origin to a fault in the functioning of the central nervous system rather than to any adverse circumstances, surveyed the recent advances in knowledge of antidepressant drugs and successfully prescribed treatment giving protection from recurrence of the symptoms."
Seems like a more intuitive way to quantify effect size would be something like "cure rate". Like take P to be the outcome score for the placebo group, X to be the outcome score for the treatment group, and H to be the outcome score for healthy people. The cure rate could be defined as (X-P)/(H-P).
Two limiting cases for intuitively interpreting it:
* If the drug effectiveness varies from person to person but it is always either 100% effective or 0% effective, then this number gives the % of people it worked on.
* If the drug effectiveness is consistent across people then this number gives fraction by which they were moved to the healthy group.
You've pretty much reinvented (slightly transformed) Number Needed to Treat, if I'm reading you correctly. My impression is that NNT if anything gives more counterintuitive results - widely accepted medications have confusingly large NNTs.
Weirdly I have run into anti-ibuprofen people. Mainly in Europe where there is a strange bias against taking medication of any kind. They’ll tell you it doesn’t actually work much or at all, it’s just placebo, and there are dangerous side effects.
I'm one of the people for whom ibuprofen does nothing. Paracetamol (acetaminophen) works, and aspirin works best of all but kills my stomach.
There are some serious side-effects, I imagine they're rare but the NHS gives a warning about them:
https://medlineplus.gov/druginfo/meds/a682159.html
https://www.nhsinform.scot/tests-and-treatments/medicines-and-medical-aids/types-of-medicine/ibuprofen
I am somewhat similar; ibuprofen does work on me, but requires a borderline dangerous dose to do so. Acetaminophen works fine.
EDIT: I'm also sadly immune (or the closest thing to it) to the beneficial effects of pot and alcohol. If I smoke/drink/ingest enough of it, I get dizzy, but that's all the effect I ever get (besides the usual hangover symptoms). Maybe the two phenomena are connected somehow ?
I'm not "anti-ibuprofen" but I've never noticed it having much of an effect on me. Same with Tylenol and most pain-relievers you can buy easily in stores. OTOH Anti-Biotics, Recreational Drugs, and ADHD medication I've tried always have an obvious, very easily noticeable effect
Fantastic post. The part that resonated with me most was how effect sizes take the mean effect on a population rather than looking at sub populations where the effect is large and meaningful once it’s undiluted by the non responders. A classic peril of cohort studies
https://open.substack.com/pub/zantafakari/p/5-the-perils-of-the-cohort-study?r=p7wqp&utm_medium=ios&utm_campaign=post
Somewhat unrelatedly, I would love if you would cover (or perhaps you already have?) people who respond immediately to SSRIs. I feel the effects of very low doses (5mg fluoxetine) the next day. I assumed it was placebo at first but since then I’ve learned that responding to increased serotonin bioavailability may be a thing in PMDD, which I qualify for. I take 5mg a night during my luteal phase and it really helps! It’s kind of a new treatment that not all psychiatrists know about, and I would love to know more about it.
It would be interesting if you could devise a placebo study for yourself.
lol my husband would probably not appreciate that
The thing is, you only need to think about studies and effect sizes and stuff for things which don't work all that well. If we pick some slightly unusual gourmet mushrooms and cook and eat them with a mate who goes into convulsions, we don't dismiss the effect as an uncontrolled retrospective observational study with n=1 and no ethics clearance, we stop eating the mushrooms. Conversely opium and antibiotics were adopted by the medical profession before well designed studies were invented, because they obviously work.
For the practical question of "should I eat this mushroom", sure. But the effect size of a hypothetical mushroom study would be very helpful for calibrating the metrics being used.
This was a really interesting article, thanks for posting!
Recently, I was asked if I wanted to pay £1500 for "pharmacogenetic testing" (I think I got the term right). I looked it up because I was skeptical, and I found an anti-psychiatry Reddit thread [1] arguing that its effect size was small. This confirmed what I thought, and I decided not to pay. This article has made me think again.
[1] - https://www.reddit.com/r/Antipsychiatry/comments/w103f9/has_anyone_done_the_pharmacogenetics_gene_test_to/
Has SSC/ACT ever done a deep dive into Robin Hanson's argument that much of healthcare overall is wasteful? See, for example, the Oregon Medicaid Experiment.
Maybe we start with Ivan Illich's medical nemesis
With regard to the chart by Leucht, I was wondering what measure of effect size they use to get values above one. Certainly not Pearson's r. SDM might denote squared deviations from the mean, which is not mentioned on the Wikipedia article on effect size.
https://en.wikipedia.org/wiki/Squared_deviations_from_the_mean
https://en.wikipedia.org/wiki/Effect_size
--
I think that instead having a fixed cut-off at a given effect size, one will obviously want to do a cost benefit analysis over likely treatment options.
I would imagine that the effect size of the "medication" of food to avoid starvation is close to r=1 over placebo (e.g. sawdust). This would highly recommend diets of caviar, french fries, radioactive onions, or what is commonly considered a healthy balanced diet as a method to prevent starvation. However, these treatments vary very much by side effects and costs.
For depression, the sad truth is that we do not have a magic bullet. There is no medication which can simply set HAM-D the way one can adjust blood pressure or thyroxine levels. If such a medication existed, one would use it instead of the stuff with small effect sizes. I think effect sizes are useful to compare different medications (for example, I would want a patent-covered medication costing big bucks to have a significantly bigger effect size to be worth it) and to measure where in the "prescribing effective treatment" vs "grasping at straws" spectrum we are.
--
What about antibiotics for infections? From my understanding, viruses generally do not respond to them, while bacteria do. That would mean that according to the NICE criteria, antibiotics would not be indicated to treat a life-threatening infection which is caused by viruses 60% of the time (at least until the type of pathogens is confirmed). I am not a MD, but my impression is that this is not how it works.
As someone who has participated (as a patient) in antidepressant clinical trials, I will say that part of the issue is that depression presents differently in different people, it is a complex multifactorial condition, and subjectively, from my experience, scales like HAM-D basically suck at measuring depression.
If the drug clears up one factor that really bothers the patient (e.g. anhedonia, lethargy, etc) and has a large effect on their quality of life, the patient will feel like it's helping even if that effect is swamped by all the things it didn't help that are measured by the scale.
That's as opposed to e.g. metformin, near the top of the effect size chart presented, which is being scored against an objective single factor (blood glucose).
Great post! I practically agree with everything here.
Some additional considerations:
1) This is a great paper on effect sizes in psychological research (mostly correlations), where they look at well-understood benchmarks or concrete consequences, and conclude "... an effect-size r of .20 indicates a medium effect that is of some explanatory and practical use even in the short run and therefore even more important, and an effect-size r of .30 indicates a large effect that is potentially powerful in both the short and the long run. A very large effect size (r = .40 or greater) in the context of psychological research is likely to be a gross overestimate that will rarely be found in a large sample or in a replication."
https://journals.sagepub.com/doi/full/10.1177/2515245919847202
2) Estimates of what constitutes a "clinically relevant" or "clinically significant" effect in depression are all over the place. This 2014 paper by Pim Cuijpers et al uses a somewhat unconventional approach but arrives at a tentative clinical relevance cutoff of SMD = 0.24. I think it's a good illustration that we can pick a variety of thresholds using different methods and its not clear why any one threshold should be privileged.
https://onlinelibrary.wiley.com/doi/abs/10.1002/da.22249
3) Effect size expressed as Cohen's d is an abstract and uncontextualized statistic, which makes practical relevance quite unclear. In the case of antidepressants (and psychotherapy), however, the problem goes beyond a mere reliance on Cohen's d. A Cohen's d of 0.3 corresponds to a 2 point difference on HAM-D, and critics would say "Cohen's d of 0.3 may or may not be clinically relevant, but surely a 2 point difference on HAM-D, a scale that goes from 0-52, doesn't mean anything."
The problem with this, in my view, is that
i) a reliance on *average effect* obscures meaningful heterogeneity in response
https://awaisaftab.substack.com/p/the-case-for-antidepressants-in-2022
ii) we conflate a 2 point change in HAM-D from baseline with a 2 point HAM-D difference from placebo, and our intuitions regards the former do not carry over well to the latter. A 2 point change in HAM-D may mean very little but a 2 point different from placebo could mean a lot (especially since antidepressant effects and placebo/expectancy effects are not summative)
iii) A 2 point change in HAM-D, depending on what exactly that change is, might still be quite meaningful. If the depressed mood item goes from a 4 (pervasive depressed mood) to a 2 (depressed mood spontaneously reported verbally), and nothing else changes, that 2 point change may very well be quite significant for the patient.
You've alluded to i) and ii) in your post as well, but I just wanted to make explicit that this goes beyond Cohen's d to actual differences on rating scales. Overall, I think the whole antidepressant efficacy controversy is a lesson in how a near-exclusive reliance on research statistics with arbitrary cut-offs can mislead us with regards to clinical significance of a phenomenon.
What is d? That is, what is the standard deviation?
The obvious thing to do is use the population standard deviation on HAM-D. The drawback is that you might worry about systematic difference in how the test is administered and you want to use only data from your study. But you aren't studying the population. You are studying a population with a large standard deviation (eg, if you cure half of the people, the sample afterwards is diverse) which could exaggerate the effect size and suppress d. Is this how the standard deviation is defined?
The distribution of HAM-D on the general population is bimodal with d=5 between normal people and depressed people. You only need to cure 10% of them to achieve the d=0.5 criterion, or 20% to achieve the d=1 criterion. (Here the standard deviation is not for the population, but of the normal mode.)
But what if you include mildly depressed people? If they are in the normal part of the distribution, then HAM-D registers them as not depressed. Why did you label them depressed? Not because of HAM-D. If it cannot detect their illness, it cannot detect their cure.
Typo: “even a drug that significantly improves 100% of patients improve”
And almost no one takes into account the fact that repeated takings of the same drug invoke learning responses like classical conditioning. I wrote about this several years ago and should probably come back to it to see what's happened in the literature since 2017.
http://www.intergalacticmedicineshow.com/cgi-bin/mag.cgi?do=columns&vol=randall_hayes&article=014
You should probably mention explicitly what effect size metric is being used. I'm pretty sure I know, based on familiarity with the topic, but it's not obvious.
Cures for depression as mentioned in the Danish simulation study may not be just a theoretical construct. Psychedelics are showing a lot of promise.
https://jamanetwork.com/journals/jamaoncology/article-abstract/2803623
PubMed
https://pubmed.ncbi.nlm.nih.gov/37052904/
Judicious use of psilocybin has been very helpful for me. Cyclical major depression has been a constant in my life. I have gone back and forth with sertraline since the 90s, and I would have to say it has been helpful even if the whole effect might really be placebo; I can’t rule it out. I recently stopped taking it, just because, so we will see. Microdosing psilocybin is very tangible though. And a game changer for my psyche. I just hope the mental health authorities don’t fuck it up.
It happened before, in the 60s, and it might very well again, despite the incredible care that researchers have been taking this time around.
Someone (Sarah Constantin?) did a lot with encouraging depressed people to keep records of their moods because depressed people are bad at noticing whether their moods improve. Does this play into efforts to evaluate anti-depressants?
I think this is important. See my comment about noticing..
"the average height difference between men and women - just a couple of inches"
Only if "a couple" = "five", which I normally would not say. Supposedly average male height is 5'9" and average female is 5'4" for the whole of the USA.
So probably best to treat it as a hypothetical saying.
There _is_ an anti-ibuprofen lobby, composed of those unfortunate individuals who thought that ibuprofen was the sort of safe medication you could use whenever you felt pain -- even though many of these people suffered from chronic pain. So they kept dosing themselves with ibuprofen, often with the approval of their doctors, who in many cases thought that any improvement their patients reported was due to the placebo effect -- but so what? It's not, they thought, as if the ibuprofen could hurt their patients. This turns out to be untrue, and their patients ended up with stomach ulcers, and liver and kidney damage. So be careful when taking the stuff.
The "drug that completely cures a fraction of its users" reminds me of Tim Leary's early finding that interventions tends to make 1/3 better, 1/3 worse, and leave 1/3 the same.
http://webseitz.fluxent.com/wiki/InterventionRoulette
I’ve long been curious what you think of Whittaker’s “Anatomy of an Epidemic”. Been a while since I read it, but the gist of it was that people have life stresses that lead to depression, like death of a relative or loss of a job or what have you; in the old days, for most people time and emotional support would be the great healer, but now they get antidepressants that might make that healing time more tolerable — but then can’t kick the med and so are treated as somebody who is still depressed.
I found it pretty plausible even though I also feel like SSRIs is what got me through my parents’ dementia and death. But I had a terrible time kicking the drug, with two relapses, finally succeeding only by tapering it slowly over eighteen months. (Toward the end a non-psychiatrist was puzzled by the minuscule dose I reported, from cutting pills into fractions and taking one only every few days; he snorted and said, “You might as well be smelling them.” And maybe I was overdoing the gradualness of my taper, but it worked for me.)
I don’t think Whitaker was going full-Thomas-Szasz and denying that depression is a medical thing, but rather arguing that depression is vastly over-diagnosed (because/and therefore) SSRIs are vastly overprescribed. If lots more people are getting diagnosed than should be, might that help explain the feebleness of these results?
If you’ve read Whitaker and found it bosh, this would also be interesting to hear.
> time and emotional support would be the great healer
Exactly. Two things in short supply in our culture.
I feel like this is one the most important things I've ever read. It makes me trust psychiatry a lot more.
How does this interact with "number needed to treat". That seems like a better metric to look at for effectiveness in this case. (Of course even with a NNT of 100, for that 1 person out of 100, it could be life changing.)
You can see the NNT numbers in the pictures of their simulations. As always, it's higher (ie worse) than you would expect.
Thanks Scott! I missed this the first time looking at the pictures.
These are actually lower (better) or about the same as I would expect. I believe the NNT for for warfarin to prevent stroke is something like 25. So an NNT of 10.5 for B isn't so bad (unless I am reading the chart incorrectly).
According to: https://www.sciencedirect.com/science/article/pii/S0924977X16000675#:~:text=The%20number%2Dneeded%2Dto%2D,)%20in%20the%20meta%2Danalysis.
the NNT for Lexapro is between 5 and 10 depending on a variety of factors.
I'm not a medical profession though, so my perspective on NNT isn't based on any experience or training.
Thanks for this very informative post. I think it highlights the risks of allowing experts to issue binding guidance on clinical practice. We still have a lot of people (including some doctors) who think NICE is the standard of excellence we should try to implement.
“Individual variation is the norm”
Lisa Feldman Barrett
Suppose there was a doctor patient conversation that went something like this:
(After trying an SSRI)
Patient: I could feel a strong defect after 4 hours.
Doctor: SSRIs don't work that way. Some drugs have an effect that rapid, but SSRIs don't. How would you feel if I said that was placebo effect?
Patient: yes, it probably was. I am well aware of the arguments that the effects of SSRIs are mostly - possibly entirely - down to the placebo effect.
Doctor: I am really very worried by that
======
Question: why would the doctor be worried?
An observation: some people get a fast reaction from SSRIs. Current medical orthodoxy seems to be that this is entirely placebo effect - which kind of implies that the therapeutic benefit is almost entirely placebo in these cases.
Maybe we should be taking sugar pills :-)
How well do most of these studies adjust for the fact that there are likely huge pharmacogenomic differences in how these drugs are metabolized and therefore in the actual drug concentrations ("actual dosages") between individuals? Do they test everyone? If not, isn't this a really big source of potential confusion?
This article mostly made me think that the claim "this drug has a meaningless effect size" is almost always going to be true. You have a number and you're calling it the "effect size", but the term doesn't mean anything and you don't know why you're doing it at all.
Two things leap out at me from this article. I can't be sure that they are really problems, but they make me queasy.
1. The amount of improvement measured for our hypothetical patients is based on the HAM-D scale. The points on this scale do not have any associated meaning. If you changed the HAM-D instrument to report different numbers, in a fully deterministic way -- the questions would be the same, the answer options would be the same, only the reported numbers would be different -- it appears to me that the measured effect size in the new "rescaled" HAM-D would be different from the measured effect size in the same set of surveys scored under the original HAM-D. Imagine that all scores below 10 are doubled, while all scores above 10 have ten added to them. (Old 7 becomes new 14. Old 24 becomes new 34. Why would we do that? Why not? This isn't much different from adding ten questions that are highly duplicative of existing questions.)
But the actual improvement we care about is in the different survey answers, not the numbers we arbitrarily assign to them. If our methodology gives us different effect sizes for the same set of answers based on nothing but the *labels* that we give to those answers, that's telling us that we have no actual way of assigning meaning to the effect size we measure.
2. In principle, the standard deviation is a statistical construct that is equally well defined regardless of what your probability distribution looks like.
But the conclusions you'd want to draw from a measurement of standard deviation are radically different depending on what your probability distribution looks like.
I tend to suspect that general rules about what effect sizes should count are formulated with a normal distribution in mind. I don't think they will translate well to other distributions. Here, on our scale from 0 to 54 in which, by definition, a large majority of people score in the range 0-7, we have a very odd distribution. It is certainly not normal (or close to being normal); it's also unlikely to be similar enough to the score distribution of any other instrument that a lesson from one could be usefully applied to the other.
When all the Kirsch stuff came out I looked at the data myself, but have not since. But what I noticed then was 1) for people with SEVERE depression it was clear that fluoxetine (at the time the most prescribed) worked extremely well. Kirsch would continually omit this fact. 2) the HAM-D seems faulty in that it is too wide a net. It could be argued that people who would be categorized as "mild" or even "moderate" may not actually have major depressive disorder. If the measure is faulty, all results coming from that measure are "off." Let me add that veterinarians prescribe these drugs for animals (all the time, for a wide range of animals) for a reason. They work. If they work on animals, that eliminates arguments about the placebo effect.
I don't think we should care about effect size at all outside other than assessing:
1) Does the benefit outweigh the risk (large risks should only be taken if there are large benefits)?
2) Is it greater than placebo? (Because then there is a clearly lower risk treatment for more benefit)
Especially if the benefit of a treatment is *additive* with other treatments, then many small improvements could be a big improvement for a patient. This is why I'm annoyed when things are dismissed for "only" improving a patient pain score 1 point on a 1-10 scale. If you can stack 3 of those it's a huge QOL improvement.
Could there be a "garbage in, garbage out" effect, wherein effect size calculations are only as informative as symptom measurements?
Somewhat related, in the comments of a previous post, I semi-seriously challenged you to forecast the effect sizes and response rates of the hopefully-published-this-year phase iii intranasal s-ketamine, racemic ketamine, and r-ketamine monotherapy trials - how well do you think you'd be able to forecast them?
"If standard deviation is very high, this artificially lowers effect size."
I'm just here to say that the point of an effect size is to get a standardized estimate of the effect. ***Dividing by the SD is a feature, not a bug.*** For all the other stuff (ITT, treatment effect heterogeneity, etc.), as with all analyses, GIGO (also I don't strictly mean "garbage" for all research considerations, more precise would be that it limits either the generalizability or specificity, either way reducing usefulness).
Facile analogy:
Levi 501 jeans are a pretty good cure for lower-body nakedness. Say they come in 50 waist/leg combinations and we have no insight into matching combo to patient. So for any given combo which we try on 50 random patients it will fit say 3, do at a pinch for another 7 and be useless for 40. Contrariwise, a large towel which can be wrapped round the waist will be an inferior cure but will sort of work for everyone, and will score better in tests than any given levis combo, and therefore than levis in general.
This is why good practitioners ask about patient experience and change the prescribed drug if it doesn't work for the individual patient. If a drug has strong effects for some people but not for others then this is the rational way to prescribe, regardless of clinical trial data. Adding together multiple populations and using a single statistical measure to try to describe the mixture is just throwing away good data on some unholy altar of "best practice".
"There’s no anti-ibuprofen lobby trying to rile people up about NSAIDs, so nobody’s pointed out that this is “clinically insignificant”. But by traditional standards, it is!"
It seems that in diseases where symptoms are relatively distinct and measurable, standards for evaluating efficacy are somewhat relaxed. Whereas in disorders where symptoms are fluid and difficult to quantify, standards for evaluating efficacy of drugs is high. This would seem to make some sense if we consider the fact that it's easier to make false effect claims in the latter than the former case. Hence, the intuitive upping of acceptance threshold.
Leucht et al.'s finding raises the question of trusting the observations of individual physicians. The number of patients any individual doctor sees is clearly too small to see consistent effects for many treatments; otherwise there wouldn't be the need for much larger studies (leaving aside the benefits of eliminating bias in RCTs)! I'd imagine many physicians probably use a kind of null statistical hypothesis testing approach in their assessment, which will have a null hypothesis of zero that needs to be rejected by strong evidence to the contrary. Some doctors will see solid effects while many will see zero to "small" effects around 0.3 and for the majority, a small real effect could be swallowed in the noise with most physicians being able to reasonably argue "I see no reliable effect" even if there is a small non-zero effect. That said, we probably don't want physicians to stop trusting the evidence of their eyes, which could be dependent on factors present in that physician's local patient population. What's the right balance?
Are there tools that provide physicians easy ways (i.e.., easier than run your own R models, which few non-academic doctors will do) to track their own patient outcomes and to test whether their observed outcomes are in line with expected effect sizes? This could give physicians a sense of "my patients tend to generally do better or worse than the reported effects," possibly indicating problems with prescription preferences or patient population differences from the general population. Do many physicians *quantitatively* track their outcomes across patients? I'd guess many don't.
Going along with the question of how easily a doctor could detect effects in a patient, how strong of evidence is needed to justify keeping an individual patient on a given SSRI vs. trying another one? Not being familiar with psychiatric guidelines, I'd be curious to know what the recommendations are.
Many thanks for your post. Statistics are wonderful things but ultimately are the drugs worth having especially if the trials are sponsored by big pharma?
I have had a few medications myself. Did you know that 'medication' is the anagram of 'decimation'?
https://alphaandomegacloud.wordpress.com/2022/09/22/medication-and-decimation/
I have looked at pharmaceuticals. 'pharmaceutical' is an anagram of 'uh a malpractice'.
https://alphaandomegacloud.wordpress.com/2023/03/04/pharmaceuticals-whats-in-them/
Given these words and big pharma's record to date I am hardly surprised.
I did read one of your links. Yes, doctors are not as knowledgeable as they could be about diet. They still know far more about diet than the average person on the street, though. You seem to be conflating individual mistakes, which are common, with bad research. There are meta analyses showing that masks reduce transmission from infected individuals to healthy ones.
You're throwing out quite a lot of evidence created by tens hundreds of thousands of people in different countries. Not even the soviets rejected germ theory. They used bacteriophage to treat typhus in their army.
> Statisticians have tried to put effect sizes in context by saying some effect sizes are “small” or “big” or “relevant” or “miniscule”.
I'd like to clarify that, for the most part, statisticians have been trying to explain that this is a bad idea to doctors who refuse to listen and keep putting arbitrary labels on the effect sizes