In their answer to Question 24 from the first paper of 2018, the college describes a non-inferiority trial as "an active control trial which tests whether an experimental treatment is not worse than the control treatment by more than a specified margin". This is almost exactly the first sentence of an excellent article by Steven Snapinn (2000), which is therefore the most sensible single reference to be recommended to time-poor exam candidates. The college question also asked for details of why one might run a non-inferiority trial, and what the limitations of such a trial might be. The objective of this chapter is to render this difficult topic into small chunks for easy storage, for the 95.9% of exam candidates who couldn't answer Question 24 in the second paper of 2018, or for the 95.5% who couldn't answer the very similar Question 23 from the first paper of 2019. It appeared again as Question 10 from the cursed first paper of 2021, but because at the time of writing we do not have official pass rates from the Plague Year, it is impossible to say whether the pass rate is getting better.
Definition of a non-inferiority trial
- Superiority trials aim to demonstrate that there is a difference between treatments, i.e. that one treatment is better than another
- Equivalence trials aim to demonstrate that the effects differ by no more than a specific amount (the "equivalence margin").
- Non-inferiority trials aim to demonstrate that an experimental treatment is not worse than an active control by more than the equivalence margin
In superiority trials, the hypothesis is that the experimental treatment is different (better) to the standard treatment, and two-sided statistical tests are used to test the null hypothesis (because the experimental treatment could be better or worse). The null hypothesis is therefore that there really is no difference. In equivalence trials the null hypothesis is that the treatments are significantly different, by a specified margin (the "equivalence margin"). In non-inferiority trials the null hypothesis is that the experimental treatment is worse than the standard treatment - and the equivalence margin determines how much worse.
The diagram below is borrowed and modified from Ian A Scott (2009), and demonstrates the results and confidence interval ranges expected of the three different types of trials, when they have demonstrated that the null hypothesis is false.
Superiority trials have to have their results well over to the "favours experimental treatment" side, usually by a pre-specified margin. Equivalence trials need to have their results and confidence intervals within that margin to confirm that the two treatments are in fact equivalent. Non-inferiority trials also need to have their results within that margin, but there is no need to prove that the treatment is superior (i.e the confidence intervals and results simply need to remain within the not much worse margin, the "+1%" line in the diagram).
Obviously, those margins are under the control of the investigators, and therefore it would be easy to force the results into successful publishable trial territory by manipulating the equivalence margin.
Advantages of non-inferiority trials
A non-inferiority trial is appropriate when:
- A placebo treatment is unethical
- The standard treatment is exceptionally effective
- The experimental treatment is thought to be equivalent or at least not worse but not superior to the current treatment (i.e. everybody is convinced that a superiority trial would show no difference)
- The experimental treatment is expected to be similar to the standard treatment in terms of the primary outcome, but has other unrelated advantages (eg. is cheaper, less invasive or more convenient) in which case it would be helpful to demonstrate that its' efficacy is not worse.
Disadvantages of non-inferiority trials
- The standard of care you test against may be more harmful than placebo.
- Because you are not testing against placebo, a situation may arise where both treatments are similarly harmful, and you have merely demonstrated that your experimental treatment is not any more harmful than the current harmful standard of care.
- Because you are not testing against placebo the effect size difference is smaller, and in order to achieve satisfactory power the sample size needs to be larger (and your trial becomes more expensive).
- If the effect of the standard treatment is very close to the effect of a placebo, then the effect of the supposedly non-inferior experimental treatment may end up being very close to the placebo.
- If you test one treatment and prove that it is not much worse, and then test another treatment proving that it is not much worse than the last, you may eventually come to a point where after multiple noninferiority trials you have demonstrated that your terrible useless treatment is not much worse than the other terrible useless treatment, something described as "biocreep", or the acceptance of progressively worse treatments.
- Equipoise is ethically necessary to run these trials, but there may be no equipoise with regards to non-inferiority (i.e. some may genuinely believe that the standard treatment is substantially superior to the experimental treatment). Considering that the null hypothesis is that the experimental treatment is much worse, some ethicists may argue that true equipoise is impossible. You basically end up consenting your enrolled patients to agree that they may be randomised to a treatment which is believed to be inferior, or which at best might turn out to be no better.
- A poorly conducted superiority trial (i.e. with many protocol violations and drop-outs) will have a result which trends towards non-inferiority because through intention-to-treat analysis the effect size of the experimental treatment will be diluted.
- The investigators are in control of the equivalence margin, which means they could have decided on an inappropriately wide margin. If the margin is established after the results become available, the experimental treatment could appear not much worse by manipulating how much worse you would accept as a threshold. Even pre-specified margins might be completely arbitrary and inappropriate. There is some pressure to select an inappropriately wide limit - the wider the limit, the smaller the sample size you will require, and the cheaper your trial. This may lead to truly ridiculous conclusions. For instance, Silvio Garattini (2007) describes the COMPASS trial where "the thrombolytic saruplase was judged equivalent to streptokinase for post-myocardial infarction, even though the saruplase group had 50% more deaths than the control group".
- For a drug company, to prove non-inferiority of a new drug is less risky than to try to demonstrate their superiority. Failure to demonstrate superiority may stop the product from making its way into the market, and doesn't look as good on the promotional literature.