Critical appraisal of clinical trials

Some might be surprised that the otherwise seemingly evidence-free environment of the ICU calls occasionally for the ability to critically evaluate clinical trials. To reflect this need, the college has asked us to describe our analytic tools on no less than four occasions:

  • Question 28 from the second paper of 2017 (Trial features which would cause you to change your practice)
  • Question 8 from the second paper of 2012 (Desirable features of a "good trial")
  • Question 6 from the first paper of 2004 (Assesment of methodological quality)
  • Question 2c from the first paper of 2001 (Criteria used to assess validity)

Validity of clinical trials

  • Internal validity: the extent to which the study design eliminates bias; the degree of confidence that the causal relationship being tested is not influenced by other factors or variables.
  • External validity: the extent to which the study results can be generalised to the greater population, which is influenced by a vast array of factors:
    • The setting and the population from which the sample was selected
    • The inclusion and exclusion criteria
    • The "randomness" of the sample, and the baseline chacteristics of the patients
    • The difference between the trial control group and the routine practice
    • The changes in practice since the publication of the trial
    • The use of patient centered outcomes
    • The degree to which the surrogate outcome measures are related to patient-centered outcomes

As always, LITFL does it better. Anyway, that's what "validity" implies.  So maybe this article you are reading meets validity criteria. But how do you know a trial is worth anything?

This is why we have checklists like the one preseted below.

Now, I am hardly an authority on EBM. One may rightly ask, "What the hell is this and where did you get it?"

Well. The checklist presented below is a lightly salted and fried version of Table 10.3 from Chapter 10 of Oh's Manual; that table is itself reproduced from the JAMA series (Users Guides to the Medical literature) which is now behind an impenetrable paywall. However, the mercenary world of evidence-based medicine is not without its good samaritans. I refer the budget-conscious reader to the excellent quality resources offered by the Oxford University; specifically the well-named CEBM (Centre for Evidence Based Medicine) which offers several checklists for critical appraisal. LITFL also has a thorough checklist, starting from the title and abstract. Beyond that, there are literally thousands of resources, offering granularity well beyonf this author's stamina for statistics.  Finfer and Delaney's chapter from Oh's manual warns us that this rabbit hole is rather deep. Without submerging into the obscene swamp of statistical mathematics, I will summarize the opinions of the Oh's chapter authors. They only had a few main things to "articulate" when it came to this topic, and hopefully those things emerge into prominent positions in the text below.

Is the premise sound? Should I even be reading this?

Is the primary hypothesis biologically plausible?

This sounds like a totally stupid opening question, but it is relevant. A lack of explainable mechanism can be used to completely discredit a study. It forms a part of the Bradford-Hill Criteria, a group of minimal conditions necessary to prove the a causal relationship. Of course, though the absence of a plausible mechanism may derail one's interpretation of a study, it does not seem to stop people from performing them, as is well demonstrated by the extensive literature examining the effect of intercessory payer at a distance.

Is the research ethical?

It better be, otherwise how did it get published?... Most would agree that any taint of accusation is enough for people to abandon the use of your data. However, this is not always the case. An excellent precedent is set by Nazi scientists. Post et al (1991)  laments the fact that between 1945 and 1991 forty-five studies were published which shamelessly used Nazi data on hypothermia, including such morbid applications as the estimation of cold weather survival for testing of artic diving suits. The author argues that it is our moral duty to never use unethically acquired data, because it crosses a certain cultural boundary and carries medicine into a dark new territory. "Abomination" is the term he uses to describe this forbidden area;  which "as a cultural concept has to do with establishing the line between civilisation and the moral abyss (the summum malum) around which ethics builds fences".

In summary, ignore studies published by the Nazis. A similar taint is shared by the people who perform unethical animal experiments and people who fabricate data.

Is there a sensible question posed?

The PICO framework is the basic framework for a question which one attempts to answer with their research.

  • P - Patient population
  • I - Intervention - the treatment or test to be investigated
  • C -Comparison - some sort of standard against which the intervention is compared (or, nothing at all)
  • O - Outcome measure

One can phrase one's research question in these terms; eg. "In a population of WW1 infantrymen, does military camouflage apparel (as compared to pink feather boas) improve 28-day survival?".

Is the methodology of high quality?

Let us say I have a biologically plausible noa trial, and it reports some sort of result. That result is either due to a genuine treatment effect, or it is due to bias and confoundment, or it is due to the trick of mischievous chance. A well designed trial minimises the effect of the latter two.

Were the inclusion/exclusion criteria appropriate?

If one has ridiculous inclusion and exclusion criteria, one may sabotage one's results. Consider: if patients who are plainly inappropriate are enrolled into an intervention trial, that intervention will appear unhelpful. Conversely, if only carefully picked patients are enrolled, external generalisability will sufer.

Was the assignment of patients to treatments randomised? 

It should have been. Non-randomised trials will always be criticised for selection bias.

Was randomisation truly random?

The best kind of randomisation is by a centralised computer. Otherwise, it ends up being an "impartial third person" like the hospital pharmacist. Otherwise, one ends up in the same boat as Bernard et al (2002) who demonstrated a survival benefit in out of hospital VF arrest survivors but was criticised for allocating therapy according to days of the week.

Were the study groups homogenous?

Unless something is horribly wrong with the randomization process, your study groups should be basically homogenous. Furthermore, good trials can reassure their readers by providing some measure of statistical significance for the degree of group homogeneity (for example, a nice low p-value)

Were the groups treated equally?

In short, the only difference between the treatment group and the control group should be the actual treatment.The study should disclose any additional "PRN" treatment going on in the background.

Are there any missing patients? Is every enrolled patient accounted for? 

Losses to follow-up should be minimal. If only a few of the patients have the outcome of interest, even small losses to follow-up tend to bias the results. Ideally, no more than 20% of patients should have been lost to the investigators.

Was follow-up complete? Is the drop-out rate explained? Do we know what happened to the dropouts?

A failure to explain a large drop-out rate is suspicious. One should report it as a potential confounding factor, and perform the data analysis in two ways (one, wheere the dropouts were negative results, and one where they were positive).

Is the reporting of an appropriate quality?

This set of criteria basically sets a certain standard for trial reporting. Long gone are the days where an Einstein might be able to submit a hand-written manuscript to a scientific journal. No longer do medical articles feature rambling dicressions by the author. These days, publications must follow a certain set of rules in order to meet with accetability criteria. The criteria are sensible, and ensure that the research can be understood and reproduced.

  • Methods describtion should be complete: the trial should be reproduceable
  • Do the results have confidence intervals?
  • Results should present relative and absolute effect sizes
  • Is a CONSORT-style flow diagram of patient selection available?
  • Discussion should contain limitation, bias and imprecision
  • Funding sources and the full trial protocol should be disclosed

Are the results of the study valid?

Was there blinding? Was blinding even possible? Was it double-blind? If not, at least were the data interpreters and statisticians blinded?

The methods section should contain some information about the blinding process. Basically, it becomes more relevant as your outcome measures become more subjective. Blinding is less crucial if the outcome is unambiguous and objective, like survival.

Was there allocation concealment?

You need to ensure that nobody can predict which group any given patient is going to be allocated to; that way both groups have an equal chance of developing the treatment outcome.

Was there intention-to-treat analysis?

This is the gold standard of statistical analysis. That means that the patients outcomes should all be analysed in the groups to which they were randomised. This decreases the attrition bias. Otherwise, how do you know the authors didn't exclude from their treatment group all the patients that were doing poorly? This makes sense when you consider that in a well designed trial the dropout loss of patients from both groups should be equal. In short, the allocation of patients should determine which group they are being analysed in, and this should not change during the trial.

If there were sub-groups, were they identified a priori?

If you pick post hoc groups, people will say that the patients in those groups were hand-picked in order to generate the impression of an effect. In scientific-sounding terms, post-randomisation group allocation introduces a risk of selection bias. Validity suffers. Better to identify the groups at the time of randomisation.

There should not be too many subgroups.

Beware the subgroup analysis. The more of these there are, the more influence random chance has on their analysis. Lets say you have a vast number of subgroups into which you have divided the study sample. The sample on its own was large enough to yield statistically significant results, but each subgroup is probably small enough to have wild and random variation in the quality of the patients, and analysis of the subgroups will yield invalid results. Each subgroup will also probably be too small, and subtle effects will not be detected.

There should be some test of interaction, rather than within-subgroup analysis.

The investigators should ask: "is the outcome in the subgroup different to the outcome outside the subgroup?"
This is the right question. The wrong question would be "is the outcome in the subgroup different between the treatment part of the subgroup vs the control part of the subgroup?" In other words, the subgroup sample population should be compared to the whole sample population, rather than to itself.

Was the statistical analysis  pre-determined?

Rather than perform a whole array of statistical acrobatics, only reporting the suitable findings, one ought to decide how one will analyse one's data, and stick to it. This removes the temptation to only publish positive results.

What were the results?

How large was the treatment effect?

One can view the magnitude of the treatment effect in a number of ways;

  • ARR (absolute risk reduction)
  • RRR (relative risk reduction)
  • NNT (numbers needed to treat)

How precisely was the effect estimated? (i.e. what was the 95% confidence interval)

The confidence interval should be reported, and it should be narrow. In fact, one should consider what the "no effect" value is (typically, its either 1 or zero). If that value falls inside the confidence interval, the results have no statistical significance.

Was the p value sufficiently low? The more outcome measures are used, the lower the p value should be.

The p value of 0.05, being the accepted limit of probability for significant results, is calibrated for a single outcome. If there are more outcomes being measured, the p value should decrease. Oh's mentions a Bonferoni correction as one method of adjusting the p value (its far from fancy, you just divide the p value by the number of outcomes being measured).

Is this study helpful for me?

Is this applicable to my patient?

Basically, this is a question of generaliseability. You have a patient in front of you. If the study was running in your unit, would this patient be eligible to enrol in it? Would your patient fit the selection criteria?

Does the population studied correspond with the population to which my patient belongs?

A good example of this is the FEAST study which has acted as a catalyst for people to question fluid resuscitation in sepsis. These days, it gets mentioned around the bedsides of elderly Anglo ICU patients even though the trial was performed among African children most of whom had malaria.

Is the study of effect, or of efficacy?

There are two very different concepts here. A study of efficacy discards all practicality, and examines whether a treatment will work under ideal conditions. A study of effectiveness is more robust: it examines whether the treatment will work under practical bedside conditions. The way one designs an efficacy trial will be very different from the way one designs an effectiveness trial. Oh's Manual points us to a 2002 article by Heber et al, where there is a beautiful table comparing the design features of effectiveness and efficacy trials. For one, the efficacy trial will have a very narrow range of inclusion criteria (and will exclude a whole bunch of "non-ideal" patients) whereas an effectiveness trial will be as inclusive as possible.

Were all the clinically meaningful outcomes considered?

Outcome measures may not have been sensible. Ideally, clinically meaningful patient-centered outcomes should be measured. Surrogate outcomes are usually poor substitutes, unless the surrogate outcome has been validated as a good surrogate for a clinically meaningful outcome. Consider: one may be fixated on oxygenation (which improves with HFOV) while ignoring survival (which does not).

Was practice misalignment an issue?

This occurs in RCTs when randomization disrupts the normal relationship between clinically important characteristics and therapy titration. Eg. it happens when you are studying a therapy which is titrated according to clinical factors -eg. noradrenaline dose. Randomisation of patients might disrupt your normal noradrenaline titration, and cause adverse outcomes in the subgroup of patients. Thus, misalignment may cause poor outcomes due to the inflexibility of the study protocol (see LITFL for some examples). To avoid practice misalignment, one ought to simulate the protocol to look for it, and be familiar with the routine of titration for this therapy. It would be important to include a true control group representing routine practice  rather than some sort of weird control group.

Does the benefit outweigh the cost and risk?

This, unfortunately, relies on personal judgment more than easily accessible formulae.

Will this trial cause me to change my practice?

Question 28 from the second paper of 2017 asked the candidates about the features of clinical trials which would cause you to change your practice. This is slightly different to asking "what makes a valid trial" or "how do you judge high-quality evidence". There are situations where practice is changed by methodologically inferior but otherwise compelling studies; or where expertly designed trials make minimal impact in the daily practice of individuals.

A good read on this specific subject is a wonderfully titled 2016 article by John Ioannidis, "Why most clinical research is not useful." The author lists his recommendations in a table, which is reproduced here as a list in an inversion of the DerangedPhysiology norms.

Problem base. The clinical trial needs to be addressing something which is a real problem, and which needs to be fixed in some way. If there is no problem, then the trial was pointless because existing practice is already good enough (i.e. no matter how good the methodological quality, the trial can be safely ignored because your practice does not need to change). Similarly, if the problem is not sufficiently serious, the cost and consequences of changing practice outweighs the benefit.

Information Gain. The clinical trial should have offered an answer which we don't already know. 

Pragmatism. The trial should be related to a real-life population and realistic settings, rather than some idealised scenario.

Patient-centered outcome. Some might argue that research should be aligned with the priorities of patients rather than those of investigators or sponsors. 

Transparency. The trial authors should be transparent in order for the results to inspire enough confidence to change practice on the basis of its results.

Where did all the free EBM tools go?

Back in the time of writing their chapter for Oh's Manual, Finfer and Delaney found the JAMA User's Guide to theMedical Literature to be a free resource, available for all to see online. How the times have changed.

Thankfully, John Hopkins Medical School offers some of these articles for free.

I link to the separate articles below.

JAMA User's guide to the Medical Literature


Oh's Intensive Care manual: Chapter 10 (p83), Clinical trials in critical care by Simon Finfer and Anthony Delaney.

Hébert, Paul C., et al. "The design of randomized clinical trials in critically ill patients." CHEST Journal 121.4 (2002): 1290-1300.

Lachin, John M. "Properties of simple randomization in clinical trials."Controlled Clinical Trials 9.4 (1988): 312-326.

JAMA: User's guides to the medical literature; see if you can get institution access to these articles.

The CONSORT statement has its own website and is available for all to peruse.

CASP (Critical Appraisal Skils Program) has checklists for the appraisal of many different sorts of studies; these actually come with tickboxes. One imagines reviewers wandering around a trial headquarters, ticking these boxes on their little clipboards.

CEBM (Centre for Evidence Based Medicine) also has checklists, which (in my opinion) are more informative.

Here is a link to their checklist for the critical appraisal of an RCT.

The JAMA User's Guide to the Medical Literature collection is reproduced at the John Hopkins Medical School.

Bernard, Stephen A., et al. "Treatment of comatose survivors of out-of-hospital cardiac arrest with induced hypothermia." New England Journal of Medicine346.8 (2002): 557-563. The famous study from Melbourne.

Masters, Kevin S., and Glen I. Spielmans. "Prayer and health: Review, meta-analysis, and research agenda." Journal of behavioral medicine 30.4 (2007): 329-338.

Post, Stephen G. "The echo of Nuremberg: Nazi data and ethics." Journal of medical ethics 17.1 (1991): 42-44.

Ioannidis, John PA. "Why most clinical research is not useful." PLoS medicine 13.6 (2016): e1002049.