Quality and levels of evidence

This chapter is a part of Section A(a) from the old 2011 Primary Syllabus; "Describe the features of evidence-based medicine, including levels of evidence (eg. NH&MRC), meta-analysis and systematic review". The levels of evidence are also rehashed briefly in the "Levels of evidence and grading the quality of recommendations" chapter from the Required Reading section for the Fellowship Exam. For whatever reason, this topic keeps changing between the primary and fellowship curricula, as do many of these statistics topics.

In the Primary exam, this comes up as Question 19 from the second paper of 2010 and the virtually identical Question 8 from the second paper of 2013. Additionally, Question 17 from the first Fellowship paper of 2012 asked for "a classification for the levels of evidence used for therapeutic studies in EBM". It also comes up in Viva 2 from the second paper of 2012, but because I wrote them it is hardly a fact worth mentioning.

 In the college answer to Question 19, a text reference was offered (Myles & Gin Statistical methods for Anaesthesia and Intensive Care, pg114-118), and some attempt was made to use this canonical resource as the main source for this summary. Unfortunately, levels of evidence are essentially untouched by that book up until the very end of page 117, where they are listed - and then discussed over the span of a paragraph. Probably the only insight afforded by Myles and Gin was that "the dimensions of evidence are all important: level, quality, relevance, strength and magnitude of effect." Fortunately, good analysis can be found in "Levels of evidence" by Wright et al, 2006. 

Apart from NHMRC, we have several systems of rating evidence to choose from, and some of these are added to the list below, even though according the the primary examiners that would not have attracted any marks as a part of a "good answer".

NHMRC levels of evidence

For a "good answer", the examiners wanted you to regurgitate the following:

  • Level I (evidence obtained from a systematic review of all (at least 2) relevant randomized controlled trials),
  • Level II (evidence obtained from at least one properly designed randomized controlled trial,
  • Level III (evidence obtained from other well-designed experimental or analytical studies (not RCCT’s),
  • Level IV (evidence obtained from descriptive studies, reports of expert committees or from opinions of respected authorities based on clinical experience).  

That would probably be enough. One can safely stop reading here.

However, if one is cursed with an insatiable lust for hierarchical grading systems, one might already recognise that in actual fact the NHMRC levels of evidence have more strata than the college answer might suggest. The entire classification system is discussed in the NHMRC document "NHMRC additional levels of evidence and grades for recommendations for developers of guidelines". Instead of wading through the entire 23-page morass, the time-poor candidate is invited to explore Table 3 on page 15. In brief:

  • Level I: systematic review of RCTs
  • Level II: RCT
  • Level III-1: pseudorandomised trial of high quality
  • Level III-2: cohort studies or case control studies - but with a control group
  • Level III-3: cohort studies with historical controls, or no control group
  • Level IV: case series

To make things more confusing, there are other grading systems, which appear to have no weaker validity than the Australian NHMRC, and which have some degree of international recognition. For example:

The original 1979 Canadian Task Force levels

  • Level I-Evidence from at least one randomized controlled trial,
  • Level II1-Evidence from at least one well designed cohort study or case control study, i.e. a controlled trial which is not randomized
  • Level II2-Comparisons between times and places with or without the intervention
  • Level III-Opinions of respected authorities, based on clinical experience, descriptive studies or reports of expert committees.

The United States Preventive Services Task Force (1989) levels

  • Level I: Evidence obtained from at least one properly designed randomized controlled trial.
  • Level II-1: Evidence obtained from well-designed controlled trials without randomization.
  • Level II-2: Evidence obtained from well-designed cohort or case-control analytic studies, preferably from more than one center or research group.
  • Level II-3: Evidence obtained from multiple time series designs with or without the intervention. Dramatic results in uncontrolled trials might also be regarded as this type of evidence.
  • Level III: Opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees.

Oxford Centre for Evidence-based Medicine  Levels of Evidence (2009)

  • Levels:
    • 1a: Systematic reviews (with homogeneity) of randomized controlled trials
    • 1b: Individual randomized controlled trials (with narrow confidence interval)
    • 1c: All or none randomized controlled trials
    • 2a: Systematic reviews (with homogeneity) of cohort studies
    • 2b: Individual cohort study or low quality randomized controlled trials (e.g. <80% follow-up)
    • 2c: "Outcomes" Research; ecological studies
    • 3a: Systematic review (with homogeneity) of case-control studies
    • 3b: Individual case-control study
    • 4: Case series (and poor quality cohort and case-control studies)
    • 5: Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles"
  • Grades:
    • A – consistent level 1 studies
    • B – consistent level 2 or 3 studies or extrapolations from level 1 studies
    • C – level 4 studies or extrapolations from level 2 or 3 studies
    • D – level 5 evidence or troubling inconsistent or inconclusive studies of any level

US Agency for Healthcare Research and Quality (AHRQ, 2014) 

Grade Definition
High High confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect.
Moderate Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.
Low Low confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate.
Insufficient Evidence either is unavailable or does not permit a conclusion.

What's the point of all this

All of these classification systems are easily and quickly available from Wikipedia. With such an array of classification systems and definitions, what is one to do? When reading an article, one might see a reference to a Level II study. Is that an RCT, a systematic review, a "well-designed controlled trial without randomization", or something else entirely, defined arbitrarily by yet another bureaucratic health agency?

One could go quite mad reading too deeply about this. In order to gain insight while retaining vestiges of sanity, the exam candidate may wish to limit themselves to the three editorials on the levels of evidence which were published (of all places!) in the Journal of the British Editorial Society of Bone and Joint Surgery. The specific articles were Tovey and Bognolo, Carr and F.T. Horan. The editorials offer arguments and counterguments regarding the use of this clasisfication system

Arguments in favour of classifying evidence into levels

  • The quality of the research must factor into the decision of whether or not the findings will be incorporated into clinical practice
  • Evaluation of research quality takes into account research methodology, and this is represented by the grading of evidence into levels.
  • Time is limited, and classification systems help clinicans whittle down their otherwise unmanageable reading lists so as to include only the highest quality of evidence.
  • Using a system of graded evidence to include or exclude studies in meta-analysis increases the confidence of the clinician in the quality of the meta-analysis 
  • The use of a simple grading system is quick and intuitive
  • There seems to be good inter-observer agreement in applying these grading systems (at least within the same system and the same journal - eg Bhandari et al, 2004)

Arguments against the hierarchical classification of evidence

  • In many cases, sound clinical practice is based on evidence which cannot be collected prospectively for ethical or logistic reasons - eg. CPR in cardiac arrest parachutes in skydiving, or pretty much all major advances in orthopaedic surgery. In fact Harvie et al (2005) found that during a ten-year period to 2002 only 19 papers on shoulder surgery were RCTs, whereas 538 were single-centre case series which would have graded very low according to the grading system.
  • Labeling an article with a level of evidence creates a "pre-reading bias" which may produce prejudice against the paper
  • Studies of otherwise high quality and scientific merit may not find their way to publication of editors insist on publishing only high-level studies


Sackett, David L., et al. "Evidence based medicine: what it is and what it isn't." (1996): 71-72.

Brown, Gary C., Melissa M. Brown, and Sanjay Sharma. "Value-based medicine: evidence-based medicine and beyond.Ocular immunology and inflammation 11.3 (2003): 157-170.

Wood, Beverly P. "What's the Evidence? 1." Radiology 213.3 (1999): 635-637.

Cook, D. J., and M. K. Giacomini. "The integration of evidence based medicine and health services research in the ICU." Evaluating Critical Care. Springer Berlin Heidelberg, 2002. 185-197.

Kotur, P. F. "Evidence-Based Medicine in Critical Care." Intensive and Critical Care Medicine. Springer Milan, 2009. 47-57.

NHMRC additional levels of evidence and grades for recommendations for developers of guidelines

Robert Lawrence; U. S. Preventive Services Task Force Edition (1989). Guide to Clinical Preventive Services

Canadian Task Force on the Periodic Health Examination. (3 November 1979). "Task Force Report: The periodic health examination."Can Med Assoc J121 (9): 1193–1254. 

Wright, J. G., M. Swiontkowski, and J. D. Heckman. "Levels of evidence." Bone & Joint Journal 88.9 (2006): 1264-1264.

Horan, F. T. "Judging the evidence." J Bone Joint Surg [Br] 2005;87-B:1589–90

Tovey D, Bognolo G. Levels of evidence and the orthopaedic surgeon. J Bone Joint Surg [Br]2005;87-B:1591–2

Carr AJ. Evidence-based orthopaedic surgery: what type of research will best improve clinical practice? J Bone Joint Surg [Br] 2005;87-B:1593–4

Harvie, P., et al. "The use of outcome scores in surgery of the shoulder." Bone & Joint Journal 87.2 (2005): 151-154.

Bhandari, Mohit, et al. "Interobserver agreement in the application of levels of evidence to scientific papers in the American volume of the Journal of Bone and Joint Surgery." J Bone Joint Surg Am 86.8 (2004): 1717-1720.