Question 2b  2001, Paper 1
You have taken over the directorship of a district hospital ICU. Part of your mandate is to establish a Quality Assurance program.
(b) What is the relevance of Evidence Based Medicine to your patients and how will you apply this?
College Answer
Evidence Based Medicine has been defined as the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. It is not new let alone revolutionary. Its relevance to the candidate’s practice is its ability to add to clinical experience, basic science and physiological principle.
Unfortunately an individual would be unable to review and critically assess all the literature available in all languages. Practitioners are dependent on reviews, metaanalyses and expert opinions. Many questions have yet to be answered effectively or in many cases are yet to be addressed at all. Other questions are beyond scientific assessment eg the use of no antibiotic in pneumonia. A complete appreciation of EBM requires review of the literature, audit of local practice ie techniques/management in one’s own ICU, implementation of EBM based practice and followup audit of results. Although not itself assessed by trials, EBM, by scientific appraisal and review, formalises an aspect of quality improvement which should be relevant to ICU practice.
Discussion
Evidence based medicine is the system of critical evaluation of published data for applicability to the management of individual patients, or something.
The college regale us with the the Sackett definition of EBM:
"Evidence based medicine is the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. The practice of evidence based medicine means integrating individual clinical expertise with the best available external clinical evidence from systematic research."
Again, one could digress extensively here, scoring virtually no marks.
Were such an essaystyle question ever to return to CICM fellowship papers, one would rant creatively, using the following points as a skeleton:
Relevance of EBM to ICU practice
 Adds to clinical experience and physiological science
 Informs nonabstract bedside decisionmaking as well as broader department policy
 Forms an aspect of quality improvement
"How will you apply this?"
 Framing a question or series of questions, which are focused and answerable
 Literature review
 Critical appraisal of the literature
 Audit of local practice
 Integration into local practice
 Audit of outcomes and refinement of implementation strategy
References
Cook, D. J., and M. K. Giacomini. "The integration of evidence based medicine and health services research in the ICU." Evaluating Critical Care. Springer Berlin Heidelberg, 2002. 185197.
Kotur, P. F. "EvidenceBased Medicine in Critical Care." Intensive and Critical Care Medicine. Springer Milan, 2009. 4757.
Question 2c  2001, Paper 1
An article appears reporting the positive effects of a new agent in a trial of 50 patients with septic shock.
(c) What criteria will you use to assess the validity of this article to your ICU?
College Answer
The criteria for assessment of such an article include:
• Is the trials design valid and powered to achieve a result? It seems doubtful in this case but a large effect in a specific group may be detected.
• Was the hypothesis based on valid evidence?
• Were all the entered patients accounted for?
• Were the groups equivalent after randomisation?
• Was there proper blinding of study personnel?
• Apart from the experimental intervention were the groups treated equivalently?
• Was the statistical analysis appropriate?
• How large was the treatment affect?
• Can the results be applied to my patients?
Discussion
Though not owrdforword identical, this question closely resembles Question 8 from the second paper of 2012, as well as Question 8 from the first paper of 2004. It discusses the assessment of the validity of a randomised controlled trial, which is discussed in greater detail elsewhere.
The answer is reproduced below, to simplify revision and damage SEO:
Is the premise sound?
 Is the primary hypothesis biologically plausible?
 Is the research ethical?
 If the results are valid, are there any disastrous logistical ethical or financial emplications to a change in practice?
Is the methodology of high quality?
 Were the inclusion/exclusion criteria appropriate?
 Was the assignment of patients to treatments randomised? If yes, then was it truly random?
 Were the study groups homogenous?
 Were the groups treated equally?
 Are there any missing patients? Is every enrolled patient accounted for?
 Was followup complete? Is the dropout rate explained? Do we know what happened to the dropouts?
Is the reporting of an appropriate quality?
 Methods describtion should be complete: the trial should be reproduceable
 Do the results have confidence intervals?
 Results should present relative and absolute effect sizes
 Is a CONSORTstyle flow diagram of patient selection available?
 Discussion should contain limitation, bias and imprecision
 Funding sources and the full trial protocol should be disclosed
Are the results of the study valid?
 Was there blinding? Was blinding even possible? Was it doubleblind? If not, at least were the data interpreters and statisticians blinded?
 Was there allocation concealment?
 Was there intentiontotreat analysis?
 If there were subgroups, were they identified a priori?
What were the results?
 How large was the treatment effect?
 How precisely was the effect estimated? (i.e. what was the 95% confidence interval)
Is this study helpful for me?
 Is this applicable to my patient? i.e. would my patient have been enrolled in this study?
 Does the population studied correspond with the population to which my patient belongs?
 Were all the clinically meaningful outcomes considered?
 Does the benefit outweigh the cost and risk?
References
Question 14  2002, Paper 2
Outline the way you would calculate and how you might use the following features of a diagnostic test: sensitivity, specificity, positive predictive value and negative predictive value.
College Answer
Disease
Present Absent
Test 
Positive 
A 
B 
A+B 
Negative 
C 
D 
C+D 

A+C 
B+D 
A+B + C+D 
Sensitivity = proportion of patients with disease detected by positive test = A/(A+C) Specificity = proportion of patients without disease detected by negative test = D/(B+D)
Positive predictive value = proportion of patients with positive test who have disease = A/(A+B) Negative predictive value = proportion of patients with negative test who do not have disease = D/(C+D)
Very high sensitivity means few false negatives. Very high specificity means few false positives.
Discussion
This question closely resembles a whole mass of other questions:
 Question 13 from the first paper of 2005
 Question 29.2 from the first paper of 2008
 Question 19.1 from the first paper of 2010
The questions may not be identical, but they test the exact same concepts Here's a helpful list of equations the college expects us to memorise.
Sensitivity = true positives / (true positives + false negatives)
This is the proportion of patients in whom disease which was correctly identified by the test.
Specificity = true negatives / (true negatives + false positives)
This is the proportion of patients in whom the disease was correctly excluded
Positive predictive value = (true positives / total positives)
This is the proportion of patients with positive test results who are correctly diagnosed.
Negative predictive value = (true negatives / total negatives)
This is the proportion of patients with negative test results who are correctly diagnosed.
References
Question 4  2003, Paper 2
Compare and contrast the use of the Chisquared test, Fisher’s Exact Test and logistic regression when analysing data.
College Answer
All these tests are widely used in the statistical reporting of data and give a representation of the likelihood that a given spread of data occurs by chance.
The Chisquare(d) statistic is used when comparing categorical data (e.g. counts). Often, these data are simply displayed in a “contingency table” with R rows and C columns. It’s use is less appropriate where total numbers are small (e.g. N <20) or smallest expected value is less than 5.
Fisher’s Exact test is used when comparing categorical data (e.g. counts), but is only generally applicable in a 2 x 2 contingence table (2 columns and 2 rows). It is specifically indicated when total numbers are small (e.g. N <20) or smallest expected value is less than 5.
Logistic regression is used when comparing a binary outcome (e.g. yes/no, lived/died) with other potential variables. Logistic regression is most commonly used to perform multivariable analysis (“controlling for” various factors), and these variables can be either categorical (e.g. gender), orcontinuous (e.g. weight), or any combination of these. The standard ICU mortality predictions are based on logistic regression analysis.
Discussion
When one is invited to "compare and contrast" things, one is well served by a table structure.
First, the prose form: much of what follows is heavily borrowed from LITFL.
Additional reading can be done, if one wishes to actually understand these concepts.
I recommend the following free online resources:
 The Chi Square Statistic from The Mathbeans Project
 Fisher's Exact Test from Wolfram Mathworld
 What is (Multivariate) Logistic Regression from LogisticRegressionAnalysis.com (which is an excellent name).
Additionally, I invite everybody to visit this page, where the author Steve Simon (presumably, somebody qualified in statistics) responds to an email he received which asked him to comment on the differences between a Chisquare test, Fisher's Exact test, and logistic regression.
Chisquare test
A statistical test commonly used to compare observed data with data we would expect to obtain according to a specific hypothesis. The chisquare test can be used to test for the "goodness to fit" between observed and expected data.
 chisquare is the sum of the squared difference between
observed (o) and the expected (e) data: χ>^{2}= χ(oe)^{2}/e  May be inappropriate if the sample numbers are small.
 Cannot be calculated if the expected value in any category is less than 5.
Fisher's Exact Test
Another test like the Chisquare test, to compare observed data with expected data.
 Used for small data sets (where Chisquare is useless)
 Only applicable in a 2x2 contingency table
Logistic regression
 Method of predicting a binary variable (eg. dead or alive) on the basis of numerous predictive factors, to compare observed and predicted data.
 ICU mortality is predicted using logistic regression analysis
 Regression coefficients allow the contribution of different predictor variables to be analysed.
 Goodness of fit can be estimated using a variety of mathematic methods.
Now that the prose is finished, let us tabulate the differences and similarities between these tests.
Chi Square  Fisher's Exact Test  Logistic regression  
Application  "give a representation of the likelihood that a given spread of data occurs by chance"  
Specific uses 
Nominal data: large samples 
Nominal data: small samples 
Binary variables 
Advantages 



Limitations 



References
The ideal reference for this is the BMJ, with their combination of rich statistics info and OldWorld credibility. I link to the relevant sections of their Statistics at Square One, by T D V Swinscow.
Question 6  2004, Paper 1
Outline the techniques you would use to assess the methodological quality of a placebo controlled prospective randomised clinical trial.
College Answer
Various checklists are available for assessing methodological quality. One such list is that proposed by David Sackett. It includes 3 main questions: was assignment randomised and was the randomisation list concealed (minimise potential for bias)?; was follow up of patients sufficiently long and complete (ensure endpoints accurately assessed)?; were patients analysed in the groups to which they were randomised (maintain benefits of randomisation)? It also includes 3 finer points to address: were patients and clinicians (and outcome assessors) kept blind to treatment (minimise bias)?; were groups treated equally apart from the experimental treatment (ensure intervention effect is only thing being assessed)?; were the groups similar at the start of the trial (were there any potentially confounding effects that randomisation did not eliminate)? In addition to these, the study should have enrolled enough patients to be sufficiently powered to detect the perceived clinically important benefit in the primary outcome variable! Standardised criteria have also been published (CONSORT) that were recommended to facilitate consistency and clarity in studies submitted for publication, allowing the reader to more readily assess the internal and external validity of a study.
(Sackett DL et al (eds.). Evidencebased medicine. Churchill Livingstone, London. 2000
Begg C et al. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA. 1996 Aug 28;276(8):6379).
Discussion
Though not wordforword identical, this question closely resembles Question 8 from the second paper of 2012. It discusses the assessment of the validity of a randomised controlled trial, which is discussed in greater detail elsewhere.
In brief:
Is the premise sound?
 Is the primary hypothesis biologically plausible?
 Is the research ethical?
 If the results are valid, are there any disastrous logistical ethical or financial emplications to a change in practice?
Is the methodology of high quality?
 Were the inclusion/exclusion criteria appropriate?
 Was the assignment of patients to treatments randomised? If yes, then was it truly random?
 Were the study groups homogenous?
 Were the groups treated equally?
 Are there any missing patients? Is every enrolled patient accounted for?
 Was followup complete? Is the dropout rate explained? Do we know what happened to the dropouts?
Is the reporting of an appropriate quality?
 Methods describtion should be complete: the trial should be reproduceable
 Do the results have confidence intervals?
 Results should present relative and absolute effect sizes
 Is a CONSORTstyle flow diagram of patient selection available?
 Discussion should contain limitation, bias and imprecision
 Funding sources and the full trial protocol should be disclosed
Are the results of the study valid?
 Was there blinding? Was blinding even possible? Was it doubleblind? If not, at least were the data interpreters and statisticians blinded?
 Was there allocation concealment?
 Was there intentiontotreat analysis?
 If there were subgroups, were they identified a priori?
What were the results?
 How large was the treatment effect?
 How precisely was the effect estimated? (i.e. what was the 95% confidence interval)
Is this study helpful for me?
 Is this applicable to my patient? i.e. would my patient have been enrolled in this study?
 Does the population studied correspond with the population to which my patient belongs?
 Were all the clinically meaningful outcomes considered?
 Does the benefit outweigh the cost and risk?
References
Question 4  2004, Paper 2
Compare and contrast the roles of parametric and nonparametric tests in analysing data, including examples of types of data and appropriate tests.
College Answer
Parametric tests are used to compare different groups of continuous variables when the data is normally (or nearnormally) distributed. Nonparametric tests do not make any assumptions about the distribution of data. They focus on order rather than absolute values, and are used to analyse data that is abnormally distributed (eg. significantly skewed) or data which represent ordered categories but may not be linear (eg. pain scores, ASA score, NYHA score). Commonly used parametric tests include the unpaired ttest (comparing 2 different groups with continuous variables [eg. age in males/females) and variations of the ANalysis Of VAriance (ANOVA: comparing multiple groups with continuous variables [eg. PaO2:FIO2 ratio in Medical/Surgical/Trauma patients). Commonly used nonparametric
tests include the MannWhitney U test (comparing 2 different groups with continuous variables [eg. ICU stay in males/females]) and the KruskalWallace test (comparing continuous variables in more than 2 groups [eg. pain score with PCA/epidural/sc morphine]).
Discussion
You use these to figure out the pvalue, i.e. the chance of getting the same results if the null hypothesis were true. There are parametric and nonparametric tests.
Parametric tests
Description of parametric tests
Parametric tests are more accurate, but require assumptions to be made about the data, eg. that the data is normally distributed (in a bell curve). If the data deviate strongly from the assumptions, the parametric test could lead to incorrect conclusions.
If the sample size is too small, parametric tests may lead to incorrect conclusions due to the loss of "normality" of sample distribution.
Examples of parametric tests:
 Normal distribution
 Students T Test
 Analysis of variance
 Pearson correlation coefficient
 Regression or multiple regression
Nonparametric tests
Description of nonparametric tests
Nonparametric tests make no assumptions about the distribution of the data. If the assumptions for a parametric test are not met (eg. the distribution has a lot of skew in it), one may be able to use an analogous nonparametric tests.
Nonparametric tests are particularly good for small sample sizes (<30). However, nonparametric tests have less power.
Examples of nonparametric tests:
 MannWhitney U test
 Wilcoxon sum test
 Wilcoxon signedrank test
 KruskalWallis test
 Friedman's test
 Spearman's rank order
References
Hoskin, Tanya. "Parametric and Nonparametric: Demystifying the Terms." Mayo Clinic CTSA BERD Resource. Retrieved from http://www. mayo. edu/mayoedudocs/centerfortranslationalscienceactivitiesdocuments/berd56. pdf(2012)
Question 13  2005, Paper 1
For each of the following terms, provide a definition, outline their derivation and outline their role:
 Sensitivity,
 Specificity,
 Positive Predictive Value,
 Negative Predictive Value.
College Answer
Test 
Disease Present 
Disease Absent 

Positive 
A 
B 
A+B 
Negative 
C 
D 
C+D 
A+C 
B+D 
A+B + C+D 
Using the presence or absence of a disease, and the result a specific test as an example: Sensitivity = proportion of patients with disease detected by positive test = A/(A+C). Very high values essential if wish to catch all with disease, and allow a negative result to virtually rule out the diagnosis.
Specificity = proportion of patients without disease detected by negative test = D/(B+D). Very high values of specificity essential if wish to catch all without the disease, and allow a positive result to rule in the diagnosis.
Positive predictive value = proportion of patients with positive test who have disease = A/(A+B). PPV allows estimate of certainty around positive result.
Negative predictive value = proportion of patients with negative test who do not have disease = D/(C+D). NPV allows estimate of certainty about a negative result.
Discussion
Later papers focus merely on the candidate's ability to apply the formulae.
One can make a strong argument for a return to questions which test one's understanding of the actual concept, rather than demanding the regurgitation of rotelearned equations.
To rotelearn the abovemention equations, here is a helpful list.
Sensitivity = true positives / (true positives + false negatives)
This is the proportion of patients in whom disease which was correctly identified by the test.
Specificity = true negatives / (true negatives + false positives)
This is the proportion of patients in whom the disease was correctly excluded
Positive predictive value = (true positives / total positives)
This is the proportion of patients with positive test results who are correctly diagnosed.
Negative predictive value = (true negatives / total negatives)
This is the proportion of patients with negative test results who are correctly diagnosed.
References
Altman, Douglas G., and J. Martin Bland. "Statistics Notes: Diagnostic tests 2: predictive values." Bmj 309.6947 (1994): 102.
Question 21  2005, Paper 2
“The absence of evidence of effect does not imply evidence of absence of effect”. Please explain how this statement applies to evaluation of the medical literature.
College Answer
Candidates were expected to think more broadly than just the “power” of a study. Consider:
No evidence  never asked the question. Low level evidence. Physiological data only. Animal data only. Ethical barriers to conducting the definitive study. Unanswerable for logistic reasons. Restrospective / case series only. Poorly designed existing studies (related to blinding, allocationconcealment, loss of follow up, intention to treat, uniform management apart from intervention., appropriate stats methods etc.). Metaanalysis pitfalls  significant disagreements with subsequent RCT. Type 2 error  false acceptance of null hypothesis  inadequate power  small single centre studies
Discussion
This question recalls a more uncivilised time, when bewildered CICM fellowship candidates were assailed by vaguely worded essay questions in an attempt to wring some sort of creative lateral thinking from their algorithmic reptile brains. The resulting confusion can be observed even in the college answer, which rather than defending any particular argument  instead exhorts us to think "broadly", and then presents us with a word salad of key phrases to consider. The modern papers are thankfully free from this sort of thing.
If one were to take this question seriously, one would structure one's response in the following manner:
Definition
“The absence of evidence of effect does not imply evidence of absence of effect” is a rebuttal to the Argument from Ignorance, which (put simply) states that if something has not been proven true, then it must be false. The rebuttal addresses the third possibility, that the currently available evidence has failed to detect a phenomenon. In the interpretation of medical literature, this means that a study that has failed to demonstrate the evidence of a risk has not succeeded in demonstrating the absence of risk. Similarly, a study which has failed to demonstrate a significant difference between two treatments has not demonstrated the absence of difference, only the absence of evidence of a difference.
Rationale
The idea that the absence of evidence for a phenomenon should imply that there is no such phenomenon is known in the form of the Kehoe principle, named after Robert Kehoe who argued that the use of leaded petrol was safe because at that stage there was no evidence to the contrary. The opposite view is known as the Precautionary Principle. It holds that in the absence of evidence, one must take a conservative stance and manage uncertain risks in a manner which most effectively serves human safety.
Advantages
In the absence of evidence, the precautionary principle recommends that the clinician takes reasonable measures to avoid threats that are serious and plausible. In this, it may be a more humanistic principle than the alternatives (such as the Expected Utility Theory).
In brief:
 Safest and most humanistic approach
 Riskaverse
 The burden of proof of safety is on the investigator
 The burden of risk and benefit analysis is on the clinician
Disadvantages
In its strongest formulation, the Precautionary Principle calls for absolute proof of safety before new treatments or techniques are adopted. Such stringent standards may result in an excessive regulation of potentially useful treatment strategies. One may envision a reductio ad absurdum where table salt is outlawed because there is insufficient evidence for its safety. Some authors have suggested that the precautionary principle "replaces the balancing of risks and benefits with what might best be described as pure pessimism". Furthermore, not all experimental questions can be answered with highlevel evidence (eg. in the case of rare diseases with insufficient sample size for RCTs, or in the cases where it is unethical to randomise intervention).
Published data may not offer sufficient evidence. The power of a study influences its ability to discern an effect of a given size, and it is possible that small studies are inadequately powered to detect a small treatment effect. Type 2 errors can be committed in this way.
In brief:
 Potentially useful treatments may be discarded for lack of evidence
 Not all treatments can be the subject of RCTs, particularly
 where sample size in by necessity small
 where randomisation is unethical
 where blinding is impossible
 Not all studies of effective treatments are appropriately powered to detect an effect of appropriate size
 Not all metaanalysis reviews are able to find all the available evidence due to publication bias
In summary:
There is a danger of misinterpreting "negative studies", because studies which have not found statistically significant differences in effect may have been inadequate to detect such an effect. In careful interpretation of medical literature one must be alert to the idea that not all negative studies are truly "negative". Decisonmaking in uncertainty should be guided by humanistic principles and careful riskvsbenefit analysis.
References
Foster, Kenneth R., Paolo Vecchia, and Michael H. Repacholi. "Science and the precautionary principle." Science 288.5468 (2000): 979981.
Alban, S. "The ‘precautionary principle’as a guide for future drug development."European journal of clinical investigation 35.s1 (2005): 3344.
Peterson, Martin. "The precautionary principle should not be used as a basis for decision‐making." EMBO reports 8.4 (2007): 305308.
Altman, Douglas G., and J. Martin Bland. "Statistics notes: Absence of evidence is not evidence of absence." Bmj 311.7003 (1995): 485.
Resnik, David B. "The precautionary principle and medical decision making."Journal of Medicine and Philosophy 29.3 (2004): 281299.
Rabin, Matthew. "Risk aversion and expected‐utility theory: A calibration theorem." Econometrica 68.5 (2000): 12811292.
Alderson, Phil. "Absence of evidence is not evidence of absence." BMJ328.7438 (2004): 476477.
Question 25  2006, Paper 1
In the context of clinical trials, define the following terms:
(a) Relative risk
(b) Absolute risk
(c) Number needed to treat
(d) Power of the study
College Answer
A number of potential definitions exist. One example for each is listed below:
Relative risk: the difference in event rates between 2 groups expressed as proportion of the event rate in the untreated group.
Absolute risk: this is the actual event rate in the treatment or the placebo group. The absolute risk reduction is the arithmetical difference between the event rates between the two groups
Number Needed to Treat: The NNT is the number of patients to whom a clinician would need to administer a particular treatment for 1 patient to receive benefit from it. It is calculated as 100 divided by the absolute risk reduction when expressed as a percentage or 1 divided by the absolute risk reduction when expressed as a proportion.
Power of the study: The probability that a study will produce a significant difference at a given significance level is called the power of the study. It will depend on the difference between the populations compared, the sample size and the significance level chosen.
Discussion
Thes question is a verbatim copy of Question 9 from the second paper of 2010.
References
Question 10  2006, Paper 2
In the context of a clinical trial, define and explain the significance of the following terms:
a) Intention to treat analysis.
b) Randomization.
College Answer
ITT is the process by which the patients are analysed in the group to which they are randomised.
There are four major lines of justification for intentiontotreat analysis.
1. Intentiontotreat simplifies the task of dealing with suspicious outcomes, that is, it guards against conscious or unconscious attempts to influence the results of the study by excluding odd outcomes.
2. Intentiontotreat guards against bias introduced when dropping out is related to the outcome.
3. Intentiontotreat preserves the baseline comparability between treatment groups achieved by randomization.
4. Intentiontotreat reflects the way treatments will perform in the population by ignoring adherence when the data are analyzed.
RANDOMISATION is the process of assigning clinical trial participants to treatment groups. Randomisation gives each participant a known (usually equal) chance of being assigned to any of the groups. Successful randomisation requires that group assignment cannot be predicted in advance.
Randomisation aims to obviate the possibility that there is a systematic difference (or bias) between the groups due to factors other than the intervention. Allocation of participants to specific treatment groups in a random fashion ensures that each group is, on average, as alike as possible to the other group(s). The process of randomisation aims to ensure similar levels of all risk factors in each group; not only known, but also unknown, characteristics are rendered comparable, resulting in similar numbers or levels of outcomes in each group, except for either the play of chance or a real effect of the intervention(s). Concealment of randomisation is vital.
Discussion
A brief answer to these questions is possible. However, by asking that the candidate "explain the significance" of these concepts, the college has authorised a torrent of gibberish. One could really get carried away with this.
a)
Definition of intention to treat analysis: This is the practice of grouping patient data according to the randomised allocation of the patient, rather than according to the treatment which they received.
According to Fischer et al,
"ITT analysis includes all randomized patients in the groups to which they were randomly assigned, regardless of their adherence with the entry criteria, regardless of the treatment they actually received, and regardless of subsequent withdrawal from treatment or deviation from the protocol."
Significance of intention to treat analysis:
 All enrolled patients have to be a part of the final analysis
 Maintains prognostic balance generated from the original random treatment allocation (preserving the bias reducing effects of randomisation)
 Avoids overoptimistic estimates of the treatment's efficacy
 Accurately models the effect of noncompliance and protocol deviations in clinical practice
 Prevents bias introduced due to outcomeassociated dropouts
 Prevents bias by resisting the posthoc manipulation of data to eliminate inconvenient outcomes
 Preserves the sample size, thus preserving the statistical power
 Minimises Type 1 error
 Allows for the greatest external validity
 Supported by the CONSORT statement
 Essential for a superiority trial
However:
 Heterogeneity may be introduced if dropouts and compliant subjects are mixed together in the final analysis
 Patients who never received the treatment are analysed together with those wo did, which dilutes the treatment effect
 A large number of dropouts and noncompliant subjects may cause a massive variation in outcome data and could make an effective treatment appear ineffective.
b)
Definition of randomisation: This is the practice of deliberately haphazard allocation of patients to study groups, in order to simulate the effect of chance. Randomisation gives each participant an equal chance of being assigned to any of the groups. Successful randmisation involves a process of allocation which cannot be predicted or "gamed" prior to allocation.
Significance of randomisation:
 Minimises selection bias
 Minimises group heterogeneity
 Controls unknown confounders, which should be randomly and evenly distributed among the groups
 Allows probability theory to be used to express the likelihood that chance is responsible for the diffences in outcome among groups.
 Failure to use random allocation and concealment of allocation were associated with relative increases in estimates of effects of 150% or more.
References
Montori, Victor M., and Gordon H. Guyatt. "Intentiontotreat principle."Canadian Medical Association Journal 165.10 (2001): 13391341.
Gupta, Sandeep K. "Intentiontotreat concept: A review." Perspectives in clinical research 2.3 (2011): 109.
Fisher LD, Dixon DO, Herson J, Frankowski RK, Hearron MS, Peace KE. Intention to treat in clinical trials. In: Peace KE, editor. Statistical issues in drug research and development. New York: Marcel Dekker; 1990. pp. 331–50. (not even a sample exists online! I was forced to quote from Gupta et.al.)
Beller, Elaine M., Val Gebski, and Anthony C. Keech. "Randomisation in clinical trials." Medical Journal of Australia 177.10 (2002): 565567.
Moher, David, Kenneth F. Schulz, and Douglas G. Altman. "The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials." BMC Medical Research Methodology 1.1 (2001): 2.
Herbert, Robert D. "Randomisation in clinical trials." Australian Journal of Physiotherapy 51.1 (2005): 5860.
Kunz, Regina, and Andrew D. Oxman. "The unpredictability paradox: review of empirical comparisons of randomised and nonrandomised clinical trials." Bmj317.7167 (1998): 11851190.
Altman, D. G., and C. J. Dore. "Randomisation and baseline comparisons in clinical trials." The Lancet 335.8682 (1990): 149153.
Zelen, Marvin. "The randomization and stratification of patients to clinical trials."Journal of chronic diseases 27.7 (1974): 365375.
Question 15  2007, Paper 1
To evaluate a new biomarker as an early index of bacteraemia, you perform the measurement in a consecutive series of 200 critically ill septic patients. You find that 100 of these patients had subsequently proven bacteraemia. Of these, 70 had a positive biomarker result. Of the remaining 100 patients without bacteraemia, 40 had a positive biomarker result.
Using the above data, show how you would calculate:
a) sensitivity
b) specificity
c) Positive predictive value
d) Negative predictive value
e) Positive Likelihood ratio
Bacteremia present 
Bacteremia absent 

Biomarker+ 
70 
40 

Biomarker 
30 
60 

100 
100 
College Answer
a) Sensitivity= (TP/ {TP + FN}) = 70/100
b) Specificity= (TN/{TN + FP}) = 60/100
c) PPV = (TP/{TP+FP}) = 70/110
d) NPV = (TN({TN+FN}) = = 60/90
e) Positive likelihood ratio= Sensitivity /1specificity = 70/40
Discussion
This question is very similar to Question 19.1 from the first paper of 2010, and almost entirely identical to Question 29.2 from the first paper of 2008.
However, it also presents one with a 2×2 table breakdown of results, and there is the added question (e), which asks the candidate to calculate a positive likelihood ratio.
That formula, and relevant others, is presented in the helpful list of equations one must memorise for the fellowship.
Thus, going through the motions...
true positives = 70
false positives = 40
true negatives = 60
false negatives = 30
a) Sensitivity = True positives / ( true positives + false negatives)
= 70 / (70 + 30) = 70%
b) Specificity = True negatives / (true negatives + false positives)
= 60 / (60 + 40) = 60%
c) Positive predictive value = True positives / (true positives + false positives)
= 70 / (70 + 40) = 63.6%
d) Negative predictive value = True negatives / (true negatives + false negatives)
= 60 / (60+30) = 66.6%
e) Positive Likelihood ratio = sensitivity / (1specificity)
= 0.7 / (1  0.6) = 1.75
References
Question 30  2007, Paper 2
a) What is a metaanalysis?
b) What is the role of metaanalysis in evidence based medicine?
c) What are the features you look for in a metaanalysis to determine if it has been well conducted?
College Answer
a) A form of systematic review that uses statistical methods to combine the results from different studies
b) roles:
1. 
↑ statistical power by ↑ sample size 

2. 
Resolve uncertainty when studies disagree 

3. 
Improve estimates of effect size 

4. 
Establish questions for future PRCTs 

c) 

1. 
Are the research questions defined clearly? 

2. 
Are the search strategy and inclusion criteria described? 

3. 
How did they assess the quality of studies? 

4. 
Have they plotted the results? 

5. 
Have they inspected the data for heterogeneity? 

6. 
How have they calculated a pooled estimate? 

7. 
Have they looked for publication bias? 
Discussion
This question  though not entirely identical  is very similar to Question 5 from the second paper of 2013. The key difference is the inclusion of the nebulous question about the role of metaanalysis in EBM. In the later paper, this was focused specifically on the advantages of metaanalysis over the analysis of a single study. If one compares the above answer to (b) with the answer (b) in Question 5, one will discover similarities, which suggests that the college was looking for a list of advantages here as well.
Thus, much of the below is a direct copy of Question 5.
a) What is a metaanalysis?
Metaanalysis is a tool of quantitative systematic review.
It is used to weigh the available evidence from RCTs and other studies based on the numbers of patients included, the effect size, and on statistical tests of agreement with other trials.
b) What is the role of metaanalysis in evidence based medicine?
 It offers an objective quantitative appraisal of evidence
 It reduces the probability of false negative results
 The combination of samples leads to an improvement of statistical power
 Increased sample size may increase the accuracy of the estimate
 It may explain heterogeneity between the results of different studies
 Inconsistencies among trials may be quantified and analysed
c) What are the features you look for in a metaanalysis to determine if it has been well conducted?
 Research questions clearly defined
 Transparent search strategy
 Thorough search protocol
 Authors contacted and unpublished data collected
 Definition of inclusion and exclusion criteria for studies
 Sensible exclusion and inclusion criteria
 Assessment of methodological quality of the included studies
 Transparent methodology of assessment
 Calculation of a pooled estimate
 Plot of the results (Forest Plot)
 Measurement of heterogeneity
 Assessment of publication bias (Funnel Plot)
 Reproduceable metaanalysis strategy (eg. multiple reviewers perform the same metaanalysis, according to the same methods)
References
Sauerland, Stefan, and Christoph M. Seiler. "Role of systematic reviews and metaanalysis in evidencebased medicine." World journal of surgery 29.5 (2005): 582587.
DerSimonian, Rebecca, and Nan Laird. "Metaanalysis in clinical trials."Controlled clinical trials 7.3 (1986): 177188.
Rockette, H. E., and C. K. Redmond. "Limitations and advantages of metaanalysis in clinical trials." Cancer Clinical Trials. Springer Berlin Heidelberg, 1988. 99104.
Walker, Esteban, Adrian V. Hernandez, and Michael W. Kattan. "Metaanalysis: Its strengths and limitations." Cleveland Clinic Journal of Medicine75.6 (2008): 431439.
Methodological Expectations of Cochrane Intervention Reviews
Question 29.1  2008, Paper 1
A Phase III study of a drug was undertaken to determine if it improved mortality in severe sepsis. The study design was a randomized, doubleblind, placebocontrolled, multicenter trial (n=1200). The mortality rates in the placebo arm and the trial drug arm were 32% and 26% respectively. There were no adverse effects noted in relation to the trial drug.
a) What do you understand by the term Phase III.?
b) What was the absolute risk reduction?
c) What was the relative risk reduction?
d) Calculate the “number needed to treat”?
College Answer
a) What do you understand by the term Phase III.?
Phase III trials compare new treatments with the best currently available treatment (the
standard treatment). Much larger sample sizes than Phase II and are usually randomised. They are aimed at being the definitive assessment of how effective the drug is, in comparison with current 'gold standard' treatment.
b) What was the absolute risk reduction?
6%
c) What was the relative risk reduction?
18.75%
d) Calculate the “number needed to treat”?
16.66
Discussion
Again, the candidate is called upon to recall equations and to perform basic mathematics. A helpful list of such equations is available.
a) a Phase III trial is a study of the treatment effect of the drug, which is performed in a large group of patients, all of whom have the disease being studied. The purpose of a a Phase III trial is to test efficacy of an experimental treatment in comparison to standard of care or "gold standard" therapy.
One can find more information about the phases of clinical research in brief in this 2011 BMJ statistics question by Phillip Sedgwick, in greater detail in this article by M.A. Rogers, and in great detail in this 2013 publication from the IJPCBS.
b) Absolute risk reduction (ARR) = (AR in treatment group  AR in control group)
In this trial, the ARR = (32%  26%) = 6%
c) The relative risk reduction (RRR) = (ARR / control group AR)
In this trial, RRR = (0.06 / 0.32) = 18.75%
d) The Numbers Needed to Treat (NNT) = (1/ARR),
In this trial, NNT = (1 / 0.06) = 16.6
References
Sedgwick, Philip. "Phases of clinical trials." BMJ 343 (2011).
Rogers, M. A. "What are the phases of intervention research." Access Academics and Research (2009).
Rohilla, Ankur, D. Sharma, and R. Keshari. "Phases of clinical trials: a review."IJPCBS 3 (2013): 7003.
Question 29.2  2008, Paper 1
You have been approached by a company which has developed a new biomarker of sepsis. They would like it tested in a cohort of critically ill septic patients. You test this biomarker in a cohort of 100 patients with proven bacteremia. You also test this biomarker in a cohort of 100 patients with drug overdose whom you use as a control. In the bacteremic group 70 patients had abnormal biomarker results. In the control group 60 patients had an abnormal biomarker results.
Calculate
a) Sensitivity
b) specificity
c) Positive predictive value
d) Negative predictive value
College Answer
Values below expressed as a percentage .
a) 
70/100 
b) 
40/100 
c) 
70/130 
d) 
40/70 
Discussion
This question is identical to Question 19.1 from the first paper of 2010. However, the college changed the numbers a little, and made the question about pancreatic necrosis.
Going though the motions,
true positives = 70
false positives = 60
true negatives = 40
false negatives = 30
a) Sensitivity = True positives / ( true positives + false negatives)
= 70 / (70 + 30) = 70%
b) Specificity = True negatives / (true negatives + false positives)
= 40 / (40 + 60) = 40%
c) Positive predictive value = True positives / (true positives + false positives)
= 70 / (70 + 60) = 53.8%
d) Negative predictive value = True negatives / (true negatives + false negatives)
= 40 / (40 + 30) = 57.1%
References
Question 23  2008, Paper 2
In the context of a randomised control trial comparing a trial drug with placebo:
a) briefly explain the following terms:
 Type 1 error
 Type 2 error
 Study power
 Effect size
b) List the factors that influence sample size.
College Answer
Type 1 error
The null hypothesis is incorrectly rejected. Type 1 errors may result in the implementation of therapy that is in fact ineffective or a false positive test result.
Type 2 error
The null hypothesis is incorrectly accepted. Type 2 errors may result in rejection of effective treatment strategies or a false negative test result.
Study power
Power is equal to 1β. Thus if β = 0.2, the power is 0.8 and the study has 80% probability of detecting a difference if one exists
Effect size
Effect size (∆) is the clinically significant difference the investigator wants to detect between the study groups. This is arbitrary but needs to be reasonable and accepted by peers. It is harder to detect a small difference than a large difference. The effect size helps us to know whether the difference observed is a difference that matters.
Factors influencing sample size
• Selected values for significance level, α, power β and effect size ∆ (smaller values mean larger sample size)
• Variance /SD in the underlying population (larger variance means larger sample size)
Discussion
The college presents a concise and effective answer to this question, which should serve as a model. Below is a nonmodel answer overgrown with the unnecessary fat of references and digressions.
a)
Type 1 error: The incorrect rejection of a null hypothesis.
 A false positive study.
 Finding a treatment effect where there actually is none.
 Results in the implementation of an ineffective treatment.
Type 2 error: the incorrect rejection of the alternative hypothesis.
 A false negative study.
 Finding no treatment effect, when there actually is one.
 Results in an effective treatment being wrongly discarded.
Study power: The probability that the study correctly rejects the null hypothesis, when the null hypothesis is false.
 Expressed as (1β), where β is the probability of Type 2 error (i.e. the probability of incorrectly accepting the null hypothesis).
 Generally, the power of a study is agreed to be 80% (i.e. = 0.2), because anything less would incur too great a risk of Type 2 error, and anything more would be prohibitively expensive in terms of sample size.
Effect size: a quantitative reflection of the magnitude of a phenomenon; in this case, the magnitude of the positive effects of a drug on the study population.
 In this case, it is the difference in the incidence of an arbitrarily defined outcome between the treatment group and the placebo group.
 Effect size suggests the clinical relevance of an outcome
 The effect size is agreed upon a priori so that a sample size can be calculated (as the study needs to be powered appropriately to detect a given effect size)
Factors which influence sample size:
There is a good article on this in Radiology (2003)
 Alpha value: the level of significance (normally 0.05)
 Betavalue: the probability of incorrectly accepting the null hypothesis (normally 0.2)
 The statistical test one plans to use
 The variance of the population (the greater the variance, the larger the sample size)
 Estimated measurement variability (similar to population variance)
 The effect size (the smaller the effect size, the larger the required sample)
References
There is an online Handbook of Biological Statistics which has an excellent overview of power analysis.
Kelley, Ken, and Kristopher J. Preacher. "On effect size." Psychological methods 17.2 (2012): 137.
Moher, David, Corinne S. Dulberg, and George A. Wells. "Statistical power, sample size, and their reporting in randomized controlled trials." Jama 272.2 (1994): 122124.
Cohen, Jacob. "A power primer." Psychological bulletin 112.1 (1992): 155.
Dupont, William D., and Walton D. Plummer Jr. "Power and sample size calculations: a review and computer program." Controlled clinical trials 11.2 (1990): 116128.
Eng, John. "Sample Size Estimation: How Many Individuals Should Be Studied? 1." Radiology 227.2 (2003): 309313.
Question 10  2009, paper 1
Inspect the data representation shown below.
10.1. What form of data representation is depicted here?
10.2. With respect to the study plots what is represented by:
 The horizontal lines?
 The position of the square?
 The size of the square?
10.3. From the data depicted what could be inferred with regard to the effectiveness of the treatment under investigation?
10.4. What further information relating to the performance of this analysis would you require in order to gauge the accuracy of the conclusions?
College Answer
10.1. What form of data representation is depicted here?
Forest Plot or Meta Analysis Graph
10.2. With respect to the study plots what is represented by: The horizontal lines?
The position of the square? The size of the square?
The position of the square and the horizontal line indicate the point estimate and the 95%
confidence intervals of the odds ratio respectively. The size of the square indicates the weight of the study.
10.3. From the data depicted what could be inferred with regard to the effectiveness of the treatment under investigation?
The depicted data suggest the treatment is not more effective than control as the 95% confidence limits of the combined odds ratio cross the vertical line.
10.4. What further information relating to the performance of this analysis would you require in order to gauge the accuracy of the conclusions?
Definition of inclusion criteria for studies
Adequate search protocol
Assessment of methodological quality
Measurement of heterogeneity
Assessment of publication bias
Discussion
This topic is explored in LITFL, where they call it a "forrest plot", perhaps out of respect for Pat Forrest. This is substantially better than Wikipedia, where this form of data representation is referred to as a blobbogram. The example LITFL use for their explanation is derived from the college question.
Anyway. The college answer is correct but very brief, and probably represents something like the "passing grade" for this 10mark question. With that in mind, and free from the need to be concise, one can launch into an exhaustingly verbose dissection of this question.
10.1  This is a forest plot. It represents the results of a metaanalysis of studies.
10.2  The standards for labelling and graphical representation are well summarised by this Cochrane document (however, it appears that careful adherence to standards is no defence against the absence of useful content).
 The horizontal lines: the confidence interval of the individual study
 The position of the square: a point estimate of the odds ratio (OR)
 The size of the square: the weight of the study according to the weighing rules of the metaanalysis, likely representing the sample size and statistical power. This is a powerful tool of psychological manipulation. A paper by a couple of psychiatrists dissected this practice, and suggested that a failure to use square size to identify study weight "may result in unnecessary attention being attracted to those smaller studies with wider conﬁdence intervals that put more ink on the page (or more pixels on the screen)".
10.3  From the forest plot, one can infer that though statistically there is a trend towards a positive treatment effect, it still does not achieve statistical significance because the range of the 95% confidence interval for their odds ratio crosses the vertical line (the vertical line being an OR of 1.0, which means "no association"). Thus, on the basis of this metaanalysis one would be forced to conclude that the treatment has no effect.
10.4  "What further information relating to the performance of this analysis would you require in order to gauge the accuracy of the conclusions?" This is a thinly veiled question about the assessment of the validity of a metaanalysis. The college answer demonstrates this in the points they used. In that context, one would theoretically be interested in every aspect of the analysis.
Generic points in the assessment of validity of a metaanalysis include the following:
 Research questions are clearly defined.
 Definition of inclusion criteria for studies is clear.
 Search protocol is adequate.
 Methodological quality of the included studies is rigorously assessed, and the assessment method is transparent.
 A pooled estimate is calculated, and the calculation is transparent.
 A graphical representation of the results is available (Forest plot).
 A measurement of heterogeneity is carried out, with appropriate corrections for heterogeneity (eg. use of fixedeffects or randomeffects analysis
 An assessment of publication bias is attempted (Funnel Plot)
If one were to only consider the presented graph, one would be more likely to respond with relevant questions for the metaanalysis authors.
 Inclusion and exclusion criteria. Study 2 is a massive outlier; it would be interesting to learn why it was included, and whether other excluded studies had similar characteristics. Potentially, the exclusion of this study would shift the overall OR off the vertical line.
 Assessment of methodological quality. Again  if the methodology of Study 2 was called into question and it were excluded, this metaanalysis would reach substantially different conclusions. It would be important to learn how the authors of the metaanalysis evaluated its methodology, and whether they were correct to include this study.
 Search strategy and attempts to detect publication bias. There are only 4 studies in the metaanalysis. The addition of another 1 or 2 studies may have a significant impact on the overall OR. If the search strategy was somehow inadequate, studies which might meet the inclusion criteria may have been missed.
 Dealing with heterogeneity. This is important, because there is substantial heterogeneity (again I point to Study 2). Excluding studies simply because they do not agree with the majoritydefeats the purpose of the metaanalysis, but it is important to correct for heterogeneityinducing differences between trials. This can be done with the use of a randomeffects model, which uses a "heterogeneity parameter" as a coefficient to downgrade the precision and weighing of each individual study's effect estimate. This model assumes that in each study the intervention had a different effect, and views each study as a random sample from a hypothetical population of similar studies. The effect of this on the forest plot may not be magical; it merely distributes the weighting (usually giving more weight to smaller studies and less to large ones; Cochrane's handbook suggests that this is because "small studies are more informative for learning about the distribution of effects across studies than for learning about an assumed common intervention effect"). Having used such a heterogeneity correction technique, one can be more confident that the resulting summed OR is not damaged by the inclusion of a garbage study. However, the use of a randomeffects model can exacerbate publication bias if the results of smaller studies are systematically different from results of larger ones (eg. small studies and independent and find no treatment effect, but large studies are funded by Big Pharma and find a treatment effect where there is none). Cochranerecommends metaanalysis authors compare the results of a fixedeffects model and randomeffects model analysis to see whether the smaller studies have a significant effect on the effect size.
References
Schriger, David L., et al. "Forest plots in reports of systematic reviews: a crosssectional study reviewing current practice." International journal of epidemiology39.2 (2010): 421429.
Lewis, Steff, and Mike Clarke. "Forest plots: trying to see the wood and the trees." Bmj 322.7300 (2001): 14791480.
Anzures‐Cabrera, Judith, and Julian Higgins. "Graphical displays for meta‐analysis: An overview with suggestions for practice." Research Synthesis Methods 1.1 (2010): 6680.
Cochrane: "Considerations and recommendations for
figures in Cochrane reviews: graphs of statistical data" 4 December 2003 (updated 27 February 2008)
Reade, Michael C., et al. "Benchtobedside review: Avoiding pitfalls in critical care metaanalysis–funnel plots, risk estimates, types of heterogeneity, baseline risk and the ecologic fallacy." Critical Care 12.4 (2008): 220.
DerSimonian, Rebecca, and Nan Laird. "Metaanalysis in clinical trials."Controlled clinical trials 7.3 (1986): 177188.
Biggerstaff, B. J., and R. L. Tweedie. "Incorporating variability in estimates of heterogeneity in the random effects model in metaanalysis." Statistics in medicine 16.7 (1997): 753768.
The Cochrane Handbook: 9.5.4 "Incorporating heterogeneity into randomeffects model"
Question 24  2009, Paper 2
What is a receiver operating characteristic plot (ROC curve) as applied to a diagnostic test? What are its advantages?
College Answer
An ROC plot is a graphical representation of sensitivity vs. 1 specificity for all the observed data values for a given diagnostic test.
Advantages:
• Simple and graphical
• Represents accuracy over the entire range of the test
• It is independent of prevalence
• Tests may be compared on the same scale
• Allows comparison of accuracy between several tests.
How it may be used:
• Can give a visual assessment of test accuracy
• May be used to generate decision thresholds or “cut off” values
• Can be used to generate confidence intervals for sensitivity and specificity and likelihood ratios.
Discussion
In this LITFL article, ROC curves are discussed in detail, but without apocryphal gibberish.
If one were to restrict oneself to what is manageable within a 10minute timeframe while mentioning all the important points, one would produce an asnwer which resembles the following:
 The ROC curve is a plot of sensitivity versus false positive rate (1specificity) for all observed values of a diagnostic test.
 It is a graphical representation of a tests' diagnostic accuracy
 It allows the comparison of accuracy between tests
 It allows the determination of cutoff values
 It can be used to generate confidence intervals for sensitivity and specificity and likelihood ratios.
Advantages:
 Simple and graphical
 Independent of prevalence
 Allows comparison between tests, on the same scale
That, of course, is the bare bones of the answer. If one were to succumb to basic human urges, one would produce an answer which resembles the following:
 The ROC curve is a plot of sensitivity vs. false positive rate, for a range of diagnostic test results.
 Sensitivity is on the yaxis, from 0% to 100%
 The ROC curve graphically represents the compromise between sensitivity and specificity in tests which produce results on a numerical scale, rather than binary (positive vs. negative results)
 ROC analysis can be used for diagnostic tests with outcomes measured on ordinal, interval or ratio scales.
 The ROC curve can be used to determine the cut off point at which the sensitivity and specificity are optimal.
 All possible combinations of sensitivity and specificity that can be achieved by changing the test's cutoff value can be summarised using a single parameter, the area under the ROC curve (AUC).
 The higher the AUC, the more accurate the test
 An AUC of 1.0 means the test is 100% accurate
 An AUC of 0.5 (50%) means the ROC curve is a a straight diagonal line, which represents the "ideal bad test", one which is only ever accurate by pure chance.
 When comparing two tests, the more accurate test is the one with an ROC curve further to the top left corner of the graph, with a higher AUC.
 The best cutoff point for a test (which separates positive from negative values) is the point on the ROC curve which is closest to the top left corner of the graph.
 The cutoff values can be selected according to whether one wants more sensitivity or more specificity.
Advantages of the ROC curves:
 A simple graphical representation of the diagnostic accuracy of a test: the closer the apex of the curve toward the upper left corner, the greater the discriminatory ability of the test.
 Allows a simple graphical comparison between diagnostic tests
 Allows a simple method of determining the optimal cutoff values, based on what the practitioner thinks is a clinically appropriate (and diagnostically valuable) tradeoff between sensitivity and false positive rate.
 Also, allows a more complex (and more exact) measure of the accuracy of a test, which is the AUC
 The AUC in turn can be used as a simple numeric rating of diagnostic test accuracy, which simplifies comparison between diagnostic tests.
References
Bewick, Viv, Liz Cheek, and Jonathan Ball. "Statistics review 13: receiver operating characteristic curves." Critical care 8.6 (2004): 508.
Sedgwick, Philip. "Receiver operating characteristic curves." BMJ 343 (2011). rather than an article, this is more of a "selfdirected learning" question with an elaborate explanatory answer.
Fan, Jerome, Suneel Upadhye, and Andrew Worster. "Understanding receiver operating characteristic (ROC) curves." Cjem 8.1 (2006): 1920.
Akobeng, Anthony K. "Understanding diagnostic tests 3: receiver operating characteristic curves." Acta Paediatrica 96.5 (2007): 644647.
Ling, Charles X., Jin Huang, and Harry Zhang. "AUC: a statistically consistent and more discriminating measure than accuracy." IJCAI. Vol. 3. 2003.
Greiner, M., D. Pfeiffer, and R. D. Smith. "Principles and practical application of the receiveroperating characteristic analysis for diagnostic tests." Preventive veterinary medicine 45.1 (2000): 2341.
Question 19.1  2010, Paper 1
To evaluate a new biomarker as an early index of infected pancreatic necrosis, you perform the measurement in a consecutive series of 200 critically ill patients with pancreatitis. You find that 100 of these patients had subsequently proven necrosis. Of these, 60 had a positive biomarker result. Of the remaining 100 patients without necrosis, 35 had a positive biomarker result.
Using the above data, show how you would calculate
a) Sensitivity
b) Specificity
c) Positive predictive value
d) Negative predictive value
College Answer
a) Sensitivity = (TP/ {TP + FN}) = 60/100
b) Specificity = (TN/{TN + FP}) = 65/100
c) Positive predictive value = (TP/{TP+FP}) = 60/95
d) Negative predictive value = (TN({TN+FN}) = 65/105
Discussion
Its not easy to overdo this discussion, given that the premise of this question rests in basic arithmetic. Given that the question is essentially maths, it is difficult to produce a "model answer" which is somehow an improvement on the already correct college answer (the only possible correct answer)
However, many people (myself included) are biologically unsuited to memorising equations. For this reason, a short list of equations to memorise has been compiled. Perhaps that's an improvement.
Thus, for this biomarker, we have the following spread of data:
 60 true positives
 35 false positives
 65 true negatives
 40 false negatives
a) Sensitivity: True positives / (true positives + false negatives)
= 60 / (60 + 40) = 60%
b) Specificity: True negatives / (true negatives + false positives)
= 65 / (65 + 35) = 65%
c) Positive predictive value: True positives / total positives
= 60 / (60 + 35) = 63%
d) Negative predictive value: True negatives / total negatives
= 65 / (65 + 40) = 62%
References
Question 19.2  2010, Paper 1
A randomized controlled clinical trial was performed to evaluate the effect of a new hormone called Rejuvenon on mortality in septic shock. 3400 patients with septic shock were studied (1700 placebo and 1700 in the Rejuvenon arms). The mortality rates in the placebo and the treatment arms were 30% and 25% respectively.
Calculate:
(a) The absolute risk reduction
(b) The relative risk reduction
(c) The number needed to treat
College Answer
Using the above data, show how you would calculate:
a) The absolute risk reduction
b) The relative risk reduction
c) The number needed to treat
ARR = 5%
RRR = 5/30*100 =16.6%
NNT =1/0.05 =20
Discussion
This question also relies on the candidate's ability to memorise equations.
Here is a helpful list of equations the candidate is expected to memorise.
a) ARR = (risk in control group  risk in treatment group)
= 30%  25%
= 5%
b) RRR = (ARR / control group AR)
= 0.05 / 0.3
= 0.166, or 16.6%
c) Numbers needed to treat (NNT) = ( 1/ ARR)
= 1 / 0.05
= 20.
References
Question 9  2010, Paper 2
In the context of clinical trials, define the following terms:
a) Relative risk
b) Absolute risk
c) Number needed to treat
d) Power of the study
College Answer
A number of potential definitions exist. One example for each is listed below:
Relative risk: the difference in event rates between 2 groups expressed as proportion of the event rate in the untreated group.
Absolute risk: this is the actual event rate in the treatment or the placebo group.
Number Needed to Treat: The NNT is the number of patients to whom a clinician would need to administer a particular treatment for 1 patient to receive benefit from it. It is calculated as 100 divided by the absolute risk reduction when expressed as a percentage or
1 divided by the absolute risk reduction when expressed as a proportion.
Power of the study: The probability that a study will produce a significant difference at a given significance level is called the power of the study. It will depend on the difference between the populations compared, the sample size and the significance level chosen.
Discussion
Some of this ground is covered in Question 23 from the second paper of 2011. It also asks about risk ratio and NNT.
Here is a link to my summary of basic terms in EBM.
Risk ratio: risk in treatment group / risk in control or placebo group
Absolute risk: Risk of event in a group (any group). Essentially, it is the incidence rate.
NNT: Numbers needed to treat; 1/ absolute risk reduction.
Power of a study: The power of a statistical test is the probability that it correctly rejects the null hypothesis, when the null hypothesis is false. This is the chance that a study is able to discern a treatment effect, if there is an actual treatment effect. It is influenced by the level of statistical significance one expects, the sample size, the variance within the studied population, and the magnitude of the effect size.
References
Cohen, Jacob. "Statistical power analysis." Current directions in psychological science 1.3 (1992): 98101.
Cook, Richard J., and David L. Sackett. "The number needed to treat: a clinically useful measure of treatment effect." Bmj 310.6977 (1995): 452454.
Viera, Anthony J. "Odds ratios and risk ratios: what's the difference and why does it matter?." Southern medical journal 101.7 (2008): 730734.
Malenka, David J., et al. "The framing effect of relative and absolute risk."Journal of General Internal Medicine 8.10 (1993): 543548.
Gail, Mitchell H., and Ruth M. Pfeiffer. "On criteria for evaluating models of absolute risk." Biostatistics 6.2 (2005): 227239.
Question 29  2011, Paper 1
With reference to a randomized controlled trial, briefly describe the terms “blinding” and “allocation concealment”.
College Answer
• Blinding and allocation concealment are methods used to reduce bias in clinical trials.
• Blinding: a process by which trial participants and their relatives, caregivers, data collectors and those adjudicating outcomes are unaware of which treatment is being given to the individual participants.
 Prevents clinicians from consciously or subconsciously treating patients differently based on treatment allocation
 Prevents data collectors from introducing bias when there is a subjective assessment to be made for eg “pain score”
 Prevents outcome assessors from introducing bias when there is a subjective outcome assessment to be made for eg Glasgow outcome score.
• Traditionally, blinded RCTs have been classified as "singleblind," "doubleblind," or "tripleblind"; The 2010 CONSORT Statement specifies that authors and editors should not use the terms "singleblind," "doubleblind," and "tripleblind"; instead, reports of blinded RCT should discuss "If done, who was blinded after assignment to interventions (for example, participants, care providers, those assessing outcomes) and how.
Allocation concealment is an important component of the randomization process and refers to the concealment of the allocation of the randomization sequence from both the investigators and the patient. Poor allocation concealment may potential exaggerate treatment effects.
Methods used for allocation concealment include sealed envelope technique, telephone or web based randomization.
Allocation concealment effectively ensures that the treatment to be allocated is not known before the patient is entered into the study. Blinding ensures that the patient / physician is blinded to the treatment allocation after enrollment into the study.
Discussion
The question is a 10mark question, but it for some reason asks for one to "briefly describe" these concepts. Judging from the college answer, a truly brief description was not the expected response.
LITFL has a thorough summary, which is not brief.
If one were to briefly describe these concepts, one would produce something like this:
Allocation concealment:
 Ensures that the patients and investigators cannot predict which treatment will be allocated to which patient before they are enrolled in the study.
 Prevents selection bias
Blinding:
 Ensures that the patients and investigators remain unaware of which treatment is being administered to which idividual patient.
 Prevents detection bias and observer bias
And if one were to go to town on this topic, one would produce something like this:
Allocation concealment
 Allocation concealment is the technique of reducing bias by preventing the prediction of treatment allocation before the allocation is completed.
 Thus, both the investigators and the patients cannot predict which patient will be selected for the treatment group and which patient will be selected for the control or placebo group
 This ensures that no selection bias influences the treatment allocation; i.e. patients are allocated randomly, which avoids the problem of investigators choosing most suitable candidates for the treatment group.
 Allocation concelament can be performed either with sealed envelopes, webbased randomisation, or random number generation.
 Alternative techniques also exist; for instance one can randomise patients according to the day of the week of their presentationto hospital, or according to the odd or even date of the calendar month. The allocation is not concealed, but the investigators still have little control over the allocation.
Blinding:
 Blinding is the technique of reducing bias by concealing the allocation of the treatment and control groups from either the patients, the investigators, the statistical analysts, or everyone involved.
 Reduces bias by preventing anybody from knowing which patient is receiving which treatment, and thus decreasing the likelihood that a particular group will receive preferential treatment, that a particular group will be assessed differently, or that a particular group will develop expectations of their treatment,
 Reduces detection bias by blinding the investigators
 Reduces observer bias by blinding the observers
 Reduces recall bias by blinding the patients
 The exact method of the blinding should be transparently reported (as per he CONSORT statement). Thus, the reader of the article should be able to immediately discern who was blinded and how.
References
Schulz, Kenneth F., and David A. Grimes. "Allocation concealment in randomised trials: defending against deciphering." The Lancet 359.9306 (2002): 614618.
Forder, Peta M., Val J. Gebski, and Anthony C. Keech. "Allocation concealment and blinding: when ignorance is bliss." Med J Aust 182.2 (2005): 879.
Schulz, Kenneth F. "Assessing allocation concealment and blinding in randomised controlled trials: why bother?." Evidence Based Mental Health 3.1 (2000): 45.
Question 23  2011, Paper 2
In the context of statistical analysis of randomised controlled trials, explain the following terms:
a) Risk ratio
b) Number needed to treat
c) Pvalue
d) Confidence intervals
College Answer
a) Risk ratio
A risk ratio is simply a ratio of risk, for example, [risk of mortality in the intervention group] / [risk of mortality in the control group].
It indicates the relative likelihood or experiencing the outcome if the patient received the intervention compared with the outcome if they received the control therapy.
b) Odds ratio
Odds ratio is the odds of an event occurring in one group to the odds of it occurring in another
c) Number needed to treat (NNT)
Number of patients that need to be treated for one patient to benefit compared with a control not receiving the treatment
1/(Absolute Risk Reduction)
Used to measure the effectiveness of a healthcare intervention, the higher the NNT the less effective the treatment
d) Pvalue
A pvalue indicates the probability that the observed result or something more extreme occurred by chance. It might be referred to as the probability that the null hypothesis has been rejected when it is true.
e) Confidence intervals
The confidence intervals indicate the level of certainty that the true value for the parameter of interest lies between the reported limits.
For example:
The 95% confidence intervals for a value indicate a range where, with repeated sampling and analysis, these intervals would include the true value 95% of the time
Discussion
This is a straighforward question about the definitions of basic everyday statistics terms.
Judging by the relatively high pass rate, over two thirds of us already have a fair grasp of this.
Additionally, please note the model answer to the odds ratio question. Clearly we are not expected to demonstrate a geniuslevel understanding of these concepts. In fact, there is no odds ratio mentioned in the college question, and the very existance of it is inferred from the fact that there is an odds ratio answer.
Anyway, it never hurts to revise the basics.
Here is a link to my summary of basic terms in EBM.
In brief:
Risk ratio: risk in treatment group / risk in control or placebo group
Odds ratio: The odds of an outcome in one group / odds of that outcome in another group.
NNT: Numbers needed to treat; 1/ absolute risk reduction.
pvalue in a research study is the probability of obtaining the same (or more extreme) study result assuming that the null hypothesis was true. It is the probability that the null hypothesis was incorrectly rejected. As a singlevalue assessment of error rate, the pvalue has its opponents.
Confidence interval: CI gives a range of results and the percentage chance that the same experimental design would produce results within this range if the experiment were repeated. Thus, a CI of 95% means that in 95% of repeated experiments the results would fall within the specified range.
The CI is a pain in the arse to calculate for the mathematicaverse Homo vulgaris. A good impression of the difficulty involved can form if one reads one of these two BMJ articles.
References
Viera, Anthony J. "Odds ratios and risk ratios: what's the difference and why does it matter?." Southern medical journal 101.7 (2008): 730734.
Szumilas, Magdalena. "Explaining odds ratios." Journal of the Canadian Academy of Child and Adolescent Psychiatry 19.3 (2010): 227.
Cook, Richard J., and David L. Sackett. "The number needed to treat: a clinically useful measure of treatment effect." Bmj 310.6977 (1995): 452454.
Goodman, Steven N. "Toward evidencebased medical statistics. 1: The P value fallacy." Annals of internal medicine 130.12 (1999): 9951004.
Morris, Julie A., and Martin J. Gardner. "Statistics in Medicine: Calculating confidence intervals for relative risks (odds ratios) and standardised ratios and rates." British medical journal (Clinical research ed.) 296.6632 (1988): 1313.
Campbell, Michael J., and Martin J. Gardner. "Statistics in Medicine: Calculating confidence intervals for some nonparametric analyses." British medical journal (Clinical research ed.) 296.6634 (1988): 1454.
Question 17  2012, Paper 1
 Briefly explain what is meant by “Evidence Based Medicine”?
 Give a classification for the levels of evidence used for therapeutic studies in EBM.
 Explain what is meant by the term “intention to treat analysis”
College Answer
a) EBM
Evidencebased medicine is the process of systematically reviewing, appraising and using clinical research findings to aid the delivery of optimum clinical care to patients
It involves considering research and other forms of evidence on a routine basis when making healthcare decisions. Such decisions include the clinical decisions about choice of treatment, test, or risk management for individual patients, as well as policy decisions for groups and populations.
b) Levels of evidence
(Any recognised system acceptable)
 Level I  Highquality, multicentre or singlecentre randomized controlled trial with adequate power; or systematic review of these studies
 Level II  Lesser quality, randomized controlled trial; prospective cohort study; or systematic review of these studies
 Level III  Retrospective comparative study; casecontrol study; or systematic review of these studies
 Level IV  Case series
 Level V  Expert opinion; case report or clinical example; or evidence based on physiology, bench research.
Level 
Therapy/Prevention, Aetiology/Harm 
1a 
Systematic review (with homogeneity) of RCTs 
1b 
Individual RCT (with narrow Confidence Interval) 
1c 
All or none (ie all patients died before the Rx became available, but some now survive on it; or when some patients died before the Rx became available, but none now die on it) 
2a 
Systematic review (with homogeneity ) of cohort studies 
2b 
Individual cohort study (including low quality RCT; e.g., <80% followup) 
2c 
"Outcomes" Research or ecologic studies (studies of group chics) 
3a 
Systematic review (with homogeneity) of casecontrol studies 
3b 
Individual CaseControl Study 
4 
Caseseries (and poor quality cohort and casecontrol studies ) 
5 
Expert opinion or based on physiology, bench research or "first principles" 
Level 

I 
Evidence from a systematic review of all relevant randomised controlled trials 
II 
Evidence from at least one properly designed randomised controlled trial 
III 
III.1 Evidence from welldesigned pseudorandomised controlled trials 
III.2 Evidence obtained from comparative studies with concurrent controls and allocation not randomised (cohort studies) or case control studies 

III.3 Evidence obtained from comparative studies with historical controls 

IV 
Evidence from case series, opinions of respected authorities, descriptive studies, reports of expert (i.e. consensus) committees, case studies. 
c) Intention to treat analysis
Analysis based on the initial treatment intent not the treatment eventually administered. Everyone who begins treatment is considered to be part of the trial whether he/she completes the trial or not. ITT analysis avoids the effects of crossover and dropout
Discussion
Evidence based medicine is the system of critical evaluation of published data for applicability to the management of individual patients. David Sackett, a great pioneer of EBM, came up with a definition which seems to be frequently quoted, and therefore probably meets with the approval of the CICM examiners:
"Evidence based medicine is the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. The practice of evidence based medicine means integrating individual clinical expertise with the best available external clinical evidence from systematic research."
As for levels of evidence, we have several systems to choose from. Here are a couple:
Oxford centre for evidence based medicine:
 Levels:
 I  systemic review of all relevant RCTs
 II  Randomized trial or observational study with dramatic effect
 III  Nonrandomized controlled cohort/followup study
 IV  Caseseries, casecontrol studies, or historically controlled studies
 V – mechanismbased reasoning (expert opinion, based on physiology, animal or laboratory studies)
 Grades:
 A – consistent level 1 studies
 B – consistent level 2 or 3 studies or extrapolations from level 1 studies
 C – level 4 studies or extrapolations from level 2 or 3 studies
 D – level 5 evidence or troubling inconsistent or inconclusive studies of any level
NHMRC levels:
 Level I: systematic review of RCTs
 Level II: RCT
 Level III1: pseudorandomised trial of high quality
 Level III2: cohort studies or case control studies  but with a control group
 Level III3: cohort studies with historical controls, or no control group
 Level IV: case series
Intention to treat analysis:
This is the practice of preserving the biascontrolling benefits of randomisation by performing analysis of all patients according to which group they were randomised to, rather than according to which treatment they actually received.
 "Once randomised, always analysed"
 All enrolled patients have to be a part of the final analysis
 This preserves the biasprotective effect of randomisation
Advantages
 A more reliable estimate of treatment effectiveness
 Prevents bias
 Minimises Type 1 errors (false positives)
 Supported by the CONSORT statement
 When intentiontotreat analysis agrees with perprotocol analysis, it increases the validity of the study
Disadvantages
 Treatment effect is diluted (ends up underestimated)
 ITT is inaccurate unless there are negligible protocol violations
 ITT alone is inappropriate for noninferiority trials
References
Sackett, David L. "Evidencebased medicine." Seminars in perinatology. Vol. 21. No. 1. WB Saunders, 1997.
Sackett, David L., et al. "Evidence based medicine: what it is and what it isn't."Bmj 312.7023 (1996): 7172.
Question 8  2012, Paper 2
A colleague directs your attention to a recently published randomised trial on a therapeutic intervention.
Outline the features of the trial that you would lead you to change your practice.
College Answer
Points to consider in the answer would be:
 Does the population studied correspond with the population the candidate expects to treat?
 Were the inclusion/exclusion criteria appropriate?
 Was the trial methodology appropriate – was there adequate blinding and randomisation?
 Was the primary outcome a clinically relevant or a surrogate endpoint?
 Was the length of follow up adequate?
 Was the trial sufficiently powered to detect a clinically relevant effect?
 Were the groups studied equivalent at baseline?
 Is the statistical analysis appropriate – was there an intention to treat analysis, have differences between groups at baseline been adjusted for? Are there multiple sub group analyses, and if so were they specified a priori?
 Is this a single centre study or multi centre?
 Were the results clinically significant rather than just statistically significant?
 Is the primary hypothesis biologically plausible with pre existing supporting evidence?
 Are the findings supported by other evidence – have these results been replicated?
 Would there be logistical and/or financial implications in practice change?
 Are there important adverse effects of the treatment?
Discussion
This question really asks, "how do you assess an RCT for validity?"
This is addressed in greater detail elsewhere.
In brief:
Is the premise sound?
 Is the primary hypothesis biologically plausible?
 Is the research ethical?
 If the results are valid, are there any disastrous logistical ethical or financial emplications to a change in practice?
Is the methodology of high quality?
 Were the inclusion/exclusion criteria appropriate?
 Was the assignment of patients to treatments randomised? If yes, then was it truly random?
 Were the study groups homogenous?
 Were the groups treated equally?
 Are there any missing patients? Is every enrolled patient accounted for?
 Was followup complete? Is the dropout rate explained? Do we know what happened to the dropouts?
Is the reporting of an appropriate quality?
 Methods describtion should be complete: the trial should be reproduceable
 Do the results have confidence intervals?
 Results should present relative and absolute effect sizes
 Is a CONSORTstyle flow diagram of patient selection available?
 Discussion should contain limitation, bias and imprecision
 Funding sources and the full trial protocol should be disclosed
Are the results of the study valid?
 Was there blinding? Was blinding even possible? Was it doubleblind? If not, at least were the data interpreters and statisticians blinded?
 Was there allocation concealment?
 Was there intentiontotreat analysis?
 If there were subgroups, were they identified a priori?
What were the results?
 How large was the treatment effect?
 How precisely was the effect estimated? (i.e. what was the 95% confidence interval)
Is this study helpful for me?
 Is this applicable to my patient? i.e. would my patient have been enrolled in this study?
 Does the population studied correspond with the population to which my patient belongs?
 Were all the clinically meaningful outcomes considered?
 Does the benefit outweigh the cost and risk?
References
Oh's Intensive Care manual: Chapter 10 (p83), Clinical trials in critical care by Simon Finfer and Anthony Delaney.
The JAMA collection via the John Hopkins Medical School
CASP (Critical Appraisal Skils Program) has checklists for the appraisal of many different sorts of studies; these actually come with tickboxes. One imagines reviewers wandering around a trial headquarters, ticking these boxes on their little clipboards.
CEBM (Centre for Evidence Based Medicine) also has checklists, which (in my opinion) are more informative.
Here is a link to their checklist for the critical appraisal of an RCT.
Question 5  2013, paper 2
With reference to the reporting of clinical trials in the literature:
a) What is a metaanalysis?
b) What are the advantages of a metaanalysis over the interpretation of an individual study?
c) List the features of a wellconducted metaanalysis.
d) What is “publication bias” and how can this impact on the validity of a metaanalysis?
College Answer
a)
 A form of systematic review that uses statistical methods to combine the results from different studies
b)
 ↑ Statistical power by ↑ sample size.
 Resolve uncertainty when studies disagree.
 Improve estimates of effect size.
 Inconsistency of results across studies can be quantified and analysed e.g. heterogeneity of studies, sampling error.
 Presence of publication bias can be investigated.
 Establish questions for future RCTs.
 May provide information regarding generalisability of results.
c)
 Clearly defined research question.
 Thorough search strategy that makes it unlikely that significant studies have been missed.
 Reproducible and clear criteria for inclusion in the metaanalysis.
 Adequate and reproducible assessment of the methodological quality of the included studies.
 Use of appropriate statistical methods to assess for heterogeneity between studies and pooling of the results of studies when appropriate.
 Utilisation of methods to ensure that the results of the metaanalysis are reproducible; e.g. two reviewers perform aspects of the study (the search, the application of the inclusion/exclusion criteria, the assessment of validity, the data extraction).
 Assessment for the presence of publication bias/small study bias with report of the results of these analyses.
d)
 Publication bias is the publication or nonpublication of studies depending on the direction and statistical significance of the results. A metaanalysis evaluating studies where there has been publication bias will be flawed, no matter how well conducted in other aspects.
 Publication bias may also extend to bias of selection of studies for inclusion in a metaanalysis based on language, journal of publication, ease of access, field of research etc., (dissemination bias).
Salient points
 Meta analysis = tool of quantitative systematic review
 Advantages:
 increased statistical power
 resolves heterogeneity
 avoids Simpson's paradox
 A good metaanalysis has:
 a wellstructured question
 broad search strategy
 transparent methodology
 attempt to exclude publication bias
 Forest plot
 measures of heterogeneity
Discussion
LITFL have an excellent resource for this.
a) What is a metaanalysis?
Metaanalysis is a tool of quantitative systematic review.
It is used to weigh the available evidence from RCTs and other studies based on the numbers of patients included, the effect size, and on statistical tests of agreement with other trials
b) What are the advantages of a metaanalysis over the interpretation of an individual study?
 A more objective quantitative appraisal of evidence
 Reduces the probability of false negative results
 The combination of samples leads to an improvement of statistical power
 Increased sample size may "normalise" the sample distribution and render the results more generalisable, i.e. increase the external validity of the findings
 Increased sample size may increase the accuracy of the estimate
 May explain heterogeneity between the results of different studies
 Inconsistencies among trials may be quantified and analysed
 RCT heterogeneity may be resolved
 Publication bias may be revealed
 Future research directions may be identified
 Avoids Simpson’s paradox, in which a consistent effect in constituent trials is reversed when results are simply pooled.
c) List the features of a wellconducted metaanalysis.
 Research questions clearly defined
 Transparent search strategy
 Thorough search protocol
 Authors contacted and unpublished data collected
 Definition of inclusion and exclusion criteria for studies
 Sensible exclusion and inclusion criteria
 Assessment of methodological quality of the included studies
 Transparent methodology of assessment
 Calculation of a pooled estimate
 Plot of the results (Forest Plot)
 Measurement of heterogeneity
 Assessment of publication bias (Funnel Plot)
 Reproduceable metaanalysis strategy (eg. multiple reviewers perform the same metaanalysis, according to the same methods)
d) What is “publication bias” and how can this impact on the validity of a metaanalysis?
 Publication bias is the influence of study results on the likelihood of their publication
 A funnel plot can be used to identify publication bias.
 A metaanalysis can be invalidated if publication bias has influenced the included studies.
 Publication bias leads to the selection of mostly positive (or mostly negative) studies, which in turn leads to positive metaanalysis results. Studies with the opposite effect may not have been selected for publication, and may not be available to the metaanalysis authors.
 Metaanalysis authors may develop an inherent publication bias by only using Englishlanguage studies, only freeaccess articles, or only focusing their search within a narrow field of research.
 Publication bias can be overcome by contacting relevant authors and requesting unpublished trial data, by searching for publications in all languages, and by searching broadly in multiple crossspecialty databases.
References
DerSimonian, Rebecca, and Nan Laird. "Metaanalysis in clinical trials."Controlled clinical trials 7.3 (1986): 177188.
Rockette, H. E., and C. K. Redmond. "Limitations and advantages of metaanalysis in clinical trials." Cancer Clinical Trials. Springer Berlin Heidelberg, 1988. 99104.
Walker, Esteban, Adrian V. Hernandez, and Michael W. Kattan. "Metaanalysis: Its strengths and limitations." Cleveland Clinic Journal of Medicine75.6 (2008): 431439.
Methodological Expectations of Cochrane Intervention Reviews
Question 26  2014, Paper 1
With reference to clinical studies:
a) Define the term "external validity".
b) Define the term "bias".
c) Briefly explain selection bias and measures to reduce it.
ollege Answer
a) External validity is the extent to which the results of a study can be generalised to other
situations, e.g. different casemix
b) Bias in statistics is defined as systematic distortion of the observed result away from the
"truth", caused by inadequacies in the design, conduct, or analysis of a trial.
c) Selection bias is caused by a systematic error in creating intervention groups, such that
they differ with respect to prognosis. The study groups differ in measured or unmeasured
baseline characteristics because of the way participants were selected or assigned.
Selection bias also means that the study population does not reflect a representative
sample of the target population. Selection bias undermines the external validity of the
study and the conclusions drawn by the study should not be extended to other patients.
Measures to reduce selection bias include:
Randomisation: Randomisation assigns patients to treatment arms by chance,
avoiding any systematic imbalance in characteristics between patients receiving
experimental versus the control intervention.
Allocation concealment: The allocation sequence is the order in which participants are
to be allocated to treatment. Allocation concealment involves not disclosing to patients
and those involved in recruiting trial participants, the allocation sequence before
random allocation occurs.
Discussion
External validity: the extent to which the study results can be generalised to the greater population, which is influenced by a vast array of factors:
 The setting and the population from which the sample was selected
 The inclusion and exclusion criteria
 The "randomness" of the sample, and the baseline chacteristics of the patients
 The difference between the trial control group and the routine practice
 The changes in practice since the publication of the trial
 The use of patient centered outcomes
 The degree to which the surrogate outcome measures are related to patientcentered outcomes
Bias: a systematic error which distorts study findings
 It is caused by flaws in study design, data collection or analysis
 It is not altered by sample size (increasing sample size only decreases random variations and the influence of chance)
 It can creep in at any stage in research, from the literature search to publishing of the results.
Selection bias: The selection of specific patients which results in a sample group which is not random, and which is not representative of a population. This can be avoided by randomisation, blinding and by allocation concealment.
The college answer actually comes from the CONSORT Statement glossary:
"Systematic error in creating intervention groups, such that they differ with respect to prognosis. That is, the groups differ in measured or unmeasured baseline characteristics because of the way participants were selected or assigned. Also used to mean that the participants are not representative of the population of all possible participants."
References
Higgins, Julian PT, and Sally Green, eds. Cochrane handbook for systematic reviews of interventions. Vol. 5. Chichester: WileyBlackwell, 2008.
Question 13  2014, paper 2
a) With respect to metaanalysis of randomised controlled trials, what is a funnel plot?
b) In the funnel plot above:
i. What do the outer dashed lines indicate?
ii. To what does the solid vertical line correspond?
c) List three factors that result in asymmetry in funnel plots.
College Answer
a) A funnel plot is a scatter plot of the effect estimates from individual studies against some measure of each study’s size or precision. The standard error of the effect estimate is often chosen as the measure of study size and plotted on the vertical axis with a reversed scale that places the larger, most powerful studies towards the top. The effect estimates from smaller studies should scatter more widely at the bottom, with the spread narrowing among larger studies.
b)
Outer dashed linestriangular region where 95% of studies are expected to lie
Solid vertical line no intervention effect
c)
i) Heterogeneity
 Size of effect differs according to study size
 Clinical differences
 Methodological differences
ii) Reporting bias
 Publication bias delayed publication, language, citation, multiple publication bias
 Selective outcome reporting
 Selective analysis/inadequate analysis reporting
 Poor design
 Fraud
iii) Chance
It was expected that candidates regularly attending journal club would have the knowledge to answer this question but overall it was not well answered and explanation of terms was poor
Discussion
The abovedepicted plot is not the gospel plot from the CICM paper, but one which I have confabulated myself. Hopefully, it bears some resemblance to the original.
a) is answered by the college in a manner which precisely reflects the wording of the Cochrane Handbook. That is indeed " a simple scatter plot of the intervention effect estimates from individual studies against some measure of each study’s size or precision ".
b)
The lines? what do they mean? Said best by the laconic college:
 Outer dashed linestriangular region where 95% of studies are expected to lie. This triangle is centred on a fixed effect summary estimate, and extens 1.96 standard errors in each direction. If no bias is present, this triangle will include about 95% of studies, provided the true treatment effect is the same in each study (i.e. none were using some sort of dodgy homemade levosimendan, for instance).
 Solid vertical line no intervention effect. This corresponds to an OR of 1.00.
c)
Causes of assymmetry are well summarised by Sterne et al (2011), whose Box 1 I have shamelessly stolen:
Reporting biases
Poor methodological quality

True heterogeneity
Artefactual
Chance

References
DerSimonian, Rebecca, and Nan Laird. "Metaanalysis in clinical trials."Controlled clinical trials 7.3 (1986): 177188.
Rockette, H. E., and C. K. Redmond. "Limitations and advantages of metaanalysis in clinical trials." Cancer Clinical Trials. Springer Berlin Heidelberg, 1988. 99104.
Walker, Esteban, Adrian V. Hernandez, and Michael W. Kattan. "Metaanalysis: Its strengths and limitations." Cleveland Clinic Journal of Medicine75.6 (2008): 431439.
Methodological Expectations of Cochrane Intervention Reviews
Sterne, Jonathan AC, et al. "Recommendations for examining and interpreting funnel plot asymmetry in metaanalyses of randomised controlled trials." Bmj 343 (2011): d4002.
Question 8  2015, Paper 1
A systematic review of the literature was undertaken comparing proton pump inhibitors with H2receptor blockers for the prevention of gastrointestinal bleeding in ICU patients.
a) Name the type of graph illustrated in the above figure. (10% marks)
b) What does it show? (25% marks)
c) What are the benefits of this type of analysis? (25% marks)
d) What are the disadvantages of this analysis? (40% marks)
College Answer
a)
Forest plot
b)
Combining the trials together, PPI use results in an odds ratio of 0.35 or reduction in the risk of bleeding compared to H2RA. Alternatively, PPI use results in 65% reduction (1 0.35) in bleeding.
c)
Combines small studies with limited power, increasing the number and thus the ability to pick up a positive effect. Small studies with low power (due to small effect, small numbers) run the risk of a Type II error.
d)
Individual studies might have different patient populations (with different risk of bleeding) or different definitions of outcome.
Individual studies might have been conducted with different degrees of rigour (blinding, etc.)
There is publication bias to positive studies so that negative studies are not reported. Need full disclosure how the studies were selected, their scientific grading, subgroup analyses and assessment of heterogeneity.
Discussion
I have no idea whether the college actually used this exact image, but certainly the paper was correctly identified by LITFL. My hat is off to Chris Nickson, who managed to track down the exact PPI vs H2A study which had this exact forest plot and OR / RRR. It was indeed the Alhazzani study from 2013.
So:
a) and b) are actually a part of the Primary exam Syllabus, and are reviewed in greater detail in the chapter on forest and boxandwhisker plots. In short:
 This is a forest plot.
 The horizontal lines – confidence intervals of the OR
 The position of the square – point estimate of the OR
 The size of the square – the weight of the study
 The vertical line: OR of 1 (no association)
 If the CI of the summed results crosses the vertical line, the treatment is no more effective than control.
 This study shows that PPIs are better than H2As in reducing the risk of bleeding.
c) and d)
Advantages of metaanalysis
 A more objective appraisal of evidence
 Reduces the probability of false negative results
 May explain heterogeneity between the results of different studies
 Avoids Simpson’s paradox, in which a consistent effect in constituent trials is reversed when results are simply pooled.
Disadvantages of metaanalysis
 Frustrated by heterogeneity of population samples and methodologies
 Selection of studies may be biased
 Negative studies are rarely published, and thus may not be included
 The metaanalysis uses summary data rather than individual data
References
Alhazzani, Waleed, et al. "Proton pump inhibitors versus histamine 2 receptor antagonists for stress ulcer prophylaxis in critically ill patients: a systematic review and metaanalysis*." Critical care medicine 41.3 (2013): 693705.
Methodological Expectations of Cochrane Intervention Reviews
Schriger, David L., et al. "Forest plots in reports of systematic reviews: a crosssectional study reviewing current practice." International journal of epidemiology39.2 (2010): 421429.
Lewis, Steff, and Mike Clarke. "Forest plots: trying to see the wood and the trees." Bmj 322.7300 (2001): 14791480.
Anzures‐Cabrera, Judith, and Julian Higgins. "Graphical displays for meta‐analysis: An overview with suggestions for practice." Research Synthesis Methods 1.1 (2010): 6680.
Cochrane: "Considerations and recommendations for
figures in Cochrane reviews: graphs of statistical data" 4 December 2003 (updated 27 February 2008)
Reade, Michael C., et al. "Benchtobedside review: Avoiding pitfalls in critical care metaanalysis–funnel plots, risk estimates, types of heterogeneity, baseline risk and the ecologic fallacy." Critical Care 12.4 (2008): 220.
DerSimonian, Rebecca, and Nan Laird. "Metaanalysis in clinical trials."Controlled clinical trials 7.3 (1986): 177188.
Biggerstaff, B. J., and R. L. Tweedie. "Incorporating variability in estimates of heterogeneity in the random effects model in metaanalysis." Statistics in medicine 16.7 (1997): 753768.
The Cochrane Handbook: 9.5.4 "Incorporating heterogeneity into randomeffects model"
Question 19  2016, Paper 1
Explain the following terms as applied to a randomised controlled clinical trial:
a) Allocation concealment. (25% marks)
b) Block randomisation, using block sizes of 4, in a trial of drug A versus drug B. (25% marks)
c) Stratification. (25% marks)
d) Minimisation algorithm. (25% marks)
College Answer
a)
Procedure for protecting the randomization process and ensuring that the clinical investigators and those involved in the conduct of the trial are not aware of the group to which the subject has been allocated
b)
Simple randomisation may result in unequal treatment group sizes; block randomisation is a method that may protect against this problem and is particularly useful in small trials.
In the context of a trial evaluating drug A or drug B and with block sizes of 4, there are 6 possible blocks of randomisation: AABB, ABAB, ABBA, BAAB, BABA, BBAA.
One of the 6 possible blocks is selected randomly and the next 4 study participants are assigned according to the order of the block. The process is then repeated as needed to achieve the necessary sample size.
c)
Stratification is a process that protects against imbalance in prognostic factors that are present at the time of randomisation.
A separate randomisation list is generated for each prognostic subgroup. Usually limited to 23 variables because of increasing complexity with more variables.
d)
This is an alternative to stratification for maintaining balance in several prognostic variables. The minimisation algorithm maintains a running total of the prognostic variables in patients that have already been randomised and then subsequent patients are assigned using a weighting system that minimizes imbalance in those prognostic variables.
Discussion
In this paper, only one candidate (2.5% of the cohort) managed to just pass this question (i.e. they got 5 marks out of 10).
a) Allocation concelament:
 This is technique of preventing selection bias.
 The selection of patients is randomised, and nobody knows what treatment the next enrolled patient will receive.
 A truly random sequence of allocations prevents the investigators from being able to predict the allocated treatment on the basis of previous allocated treatments.
 The difference between blinding and allocation concealment is that allocation concealment prevents the investigators from predicting who is getting what treatment before the patient is enrolled, whereas blinding prevents the investigators from knowing who is getting what treatment after the patient is enrolled.
b) Block randomisation:
 Arrangement of experimental subjects in blocks, designed to keep the group numbers the same.
 Usually, the block size is a multiple of the number of treatments (i.e. if it is a binary Drug A vs Drug B trial, the blocks would be in multiples of two).
 Small blocks are better than large blocks.
 The example where block sizes of 4 are used in a trial of drug A versus drug B is the same example used by Bland and Altman in their classical 1999 article, "How to randomise".
 That example now, verbatim:
"...sometimes we want to keep the numbers in each group very close at all times. Block randomisation (also called restricted randomisation) is used for this purpose. For example, if we consider subjects in blocks of four at a time there are only six ways in which two get A and two get B: 1:AABB 2:ABAB 3:ABBA 4:BBAA 5:BABA 6:BAAB. We choose blocks at random to create the allocation sequence. Using the single digits of the previous random sequence and omitting numbers outside the range 1 to 6 we get 5623665611. From these we can construct the block allocation sequence BABA/BAAB/ABAB/ABBA/BAAB, and so on. The numbers in the two groups at any time can never differ by more than half the block length. Block size is normally a multiple of the number of treatments."
c) Stratification:
 Stratification is the partitioning of subjects and results by a factor other than the treatment given.
 Stratification ensures that preidentified confounding factors are equally distributed, to achieve balance. The objective is to remove "nuisance variables", eg. the presence of neutropenia in a trial performed on septic patients. One would want to ensure that the treatment group and the placebo group had equal numbers of these haematology disasters.
d) Minimisation algorithm:
 Minimisation is a method of adaptive stratified sampling.
 The objective is to minimise the imbalance between groups of patients in a clinical trial by ensuring that the treatment group and placebo group each get an equal number of patients with some sort of predetermined characteristics which might act as confounding factors.
 The minimisation algorithm carefully places patients in groups according to thse preidentified confounding factors. Only the first patient is randomly allocated.
 Minimisation is methodologically equivalent to true randomisation, but does not correct for unknown confounders (only the known predetermined ones)
References
Altman, Douglas G., and J. Martin Bland. "How to randomise." Bmj 319.7211 (1999): 703704.
Question 20  2016, Paper 2
Evaluation of a novel serum biomarker for the rapid diagnosis of sepsis is performed in a sample of 100 patients with fever. The biomarker is compared with positive culture results as the gold standard and yields the following information:
Sepsis present {culture positive) 
Sepsis absent {culture negative) 

Biomarker positive 
30 
10 

Biomarker negative 
30 
30 

n 
60 
40 
With reference to these results, define the following and give the values for the performance of the test:
a) Sensitivity. (20% marks)
b) Specificity. (20% marks)
c) Positive predictive value. (20% marks)
d) Negative predictive value. (20% marks)
e) Accuracy . (20% marks)
College answer
Sepsis present (culture positive) 
Sepsis absent (culture negative) 

Biomarker positive 
30 (a) 
10 (b) 
(a + b) 
Biomarker negative 
30 (c) 
30 (d) 
(c + d) 
n 
60 (a + c) 
40 (b + d) 
(a+b+c+d) 
a) Ability of test to identify true positives
Or the probability the test will be positive in individuals who do have the disease
Sensitivity a/(a+c) 30/60 50%
b) Ability of test to identify true negatives
Or the probability the test will be negative in individuals who do not have the disease.
Specificity d/(b+d) 30/40 75%
c) Likelihood of positive test meaning patient has sepsis
PPV a/(a+b) 30/40 75%
d) Likelihood of negative test meaning patient does not have sepsis
NPV d/(c+d) 30/60 50%
e) The ability to differentiate patient and healthy cases correctly.
Accuracy (a+d)/(a+b+c+d) 60/100 60%
Additional Examiners' Comments:
The question clearly stated that a definition was required. Many candidates either could not define the terms or just missed this part of the question and therefore missed out on marks. This question has come up a number of times in past exams and these are basic statistical concepts that some candidates clearly do not understand.
Discussion
This question closely resembles all other previous questions about the measures of diagnostic test accuracy:
 Question 19.2 from the first paper of 2010 (Calculate sensitivity, specificity, PPV and NPV)
 Question 29.2 from the first paper of 2008 (Calculate sensitivity, specificity, PPV and NPV)
 Question 15 from the first paper of 2007 (Calculate sensitivity, specificity, PPV, NPV and PLR)
 Question 13 from the first paper of 2005 (Define sensitivity, specificity, PPV and NPV)
 Question 14 from the second paper of 2002 (Define sensitivity, specificity, PPV and NPV)
After being absent from the papers for over five years, one might have been forgiven for thinking that such calculatorintense statistics questions were demoted to the level of primary exam material (as most recent statistics questions in the Fellowship Exam have been more about interpretation of metaanalysis data and other such ultraclever "fellow level" uses of EBM). The main difference in 2016 was the addition of accuracy as one of the examined parameters. This has never been examined previously, and is not a frequently mentioned measure (even though colloquially we might use the term nearconstantly). An excellent 2008 article was used to define it for the purposes of this model answer.
Clearly, at least one candidate remembered all the definitions, and got 10 marks.
a)
 Sensitivity = true positives / (true positives + false negatives)
 This is the proportion of disease which was correctly identified.
 In this case, Sn = 30 / (30 + 30) = 50%
b)
 Specificity = true negatives / (true negatives + false positives)
 This is the proportion of healthy patients in who disease was correctly excluded
 In this case, Sp = 30 / (30 + 10) = 75%
c)
 Positive Predictive Value = true positives / total positives (true and false)
 This is the proportion of the positive tests results which are actually positive
 In this case, PPV = 30 / (30 + 10) = 75%
d)
 Negative Predictive Value = true negatives / total negatives (true and false)
 This is the proportion of negative test results which are actually negative
 In this case, NPV = 30 / (30 + 30) = 50%
e)
 Accuracy = (true positives + true negatives) / (total)
 This is the proportion of correctly classified subjects among all subjects
 In this case, accuracy = (30+30) / 100 = 50%
References
Šimundić, AnaMaria. "Measures of diagnostic accuracy: basic definitions." Med Biol Sci 22.4 (2008): 615.
Question 11  2017, Paper 1
Question 11
The following table gives information on the proportions of a population that have been exposed to a risk factor for a disease and then subsequently developed the disease.
Exposure + Indicates the proportion exposed to the risk factor (A+B)
Exposure  Indicates the proportion not exposed to the risk factor (C+D)
Disease + Indicates the proportion that subsequently developed the disease (A+C)
Disease  Indicates the proportion that did not subsequently develop the disease (B+D)
Disease + 
Disease  

Exposure + 
A 
B 
Exposure  
C 
D 
Define prevalence AND, with reference to A, B, C, D in the table above, give the prevalence of the disease in this population. (20% marks)
Define relative risk (RR) AND, with reference to A, B, C, D in the table above, derive the relative risk Of developing the disease after exposure to the risk factor. (40% marks)
Define attributable risk (AR) AND, with reference to A, B, C, D in the table above, give the attributable risk of exposure to the risk factor on developing the disease in this population. (40% marks)
College answer
a) Prevalence: number of event (e.g. disease) in a specific population at a particular time point.
Prevalence of the Disease in this population:
A+C / (A+B+C+D)
b) Relative risk is the ratio of the probability of an event occurring (e.g. developing a disease) in an exposed group to the probability of the event occurring in a comparison, in nonexposed group
[A / (A+B) ] / [ C / (C+D)]
c) Attributable risk is the difference in the rate of a condition between an exposed and unexposed population.
A/ (A+B)C/(C+D)
Discussion
This is another SAQ which makes it very easy to earn high marks, as it asks for unambiguous memorised definitions and has a clearcut right answer.
Somebody got 9.2.
Prevalence:
 The proportion of individuals in a population having a disease or characteristic in a particular population at a given time.
 Prevalence = number of affected individuals / total number in population
= (A+C) / (A+B+C+D)
Relative risk:
This is the difference in event rates between 2 groups expressed as proportion of the event rate in the untreated group. The slightly broken English of the college answer probably comes from an article similar to the 2017 article by Tenny et al, and was probably meant to say "relative risk is a ratio of the probability of an event occurring in the exposed group versus the probability of the event occurring in the nonexposed group."
 RR = absolute risk in treatment group / absolute risk in control group
(absolute risk = number of cases in group / total number of group)  Thus, RR = [A/(A+B)]/[C/(C+D)]
Attributable risk:
 This is a measure of the absolute effect of the risk of those exposed compared to unexposed. It indicates the number of cases of a disease among exposed individuals that can be attributed to that exposure
 AR = Incidence(exposed) – Incidence(unexposed)
(incidence = number of cases / population at risk)  Thus, AR = ( A / A+B)  (C /C+D)
References
Question 28  2017, Paper 2
A colleague directs your attention to a recently published randomised trial on a therapeutic intervention.
Outline the features of the trial that would lead you to change your practice.
College answer
Points to consider in the answer would be:
 Does the population studied correspond with the population the candidate expects to treat?
 Were the inclusion/exclusion criteria appropriate?
 Was the trial methodology appropriate – was there adequate blinding and randomisation?
 Was the primary outcome a clinically relevant or a surrogate endpoint?
 Was the length of follow up adequate?
 Was the trial sufficiently powered to detect a clinically relevant effect?
 Were the groups studied equivalent at baseline?
 Is the statistical analysis appropriate – was there an intention to treat analysis, have differences between groups at baseline been adjusted for? Are there multiple sub group analyses, and if so were they specified a priori?
 Is this a single centre study or multi centre?
 Were the results clinically significant rather than just statistically significant?
 Is the primary hypothesis biologically plausible with preexisting supporting evidence?
 Are the findings supported by other evidence – have these results been replicated?
 Would there be logistical and/or financial implications in practice change?
 Are there important adverse effects of the treatment?
Discussion
This is slightly different to asking "what makes a valid trial" or "how do you judge highquality evidence", even though these clearly play a role (and in fact the college answer consists of a boring list of such criteria). There are situations where practice is changed by methodologically inferior but otherwise compelling studies; or where expertly designed trials make minimal impact in the daily practice of individuals. A good read on this specific subject is a wonderfully titled 2016 article by John Ioannidis, "Why most clinical research is not useful."
In short, a trial should possess the following features in order to affect practice:
Answers to a real problem. The clinical trial needs to be addressing something which is a problem, and which needs to be fixed in some way. If there is no problem, then the trial was pointless because existing practice is already good enough (i.e. no matter how good the methodological quality, the trial can be safely ignored because your practice does not need to change). Similarly, if the problem is not sufficiently serious, the cost and consequences of changing practice outweighs the benefit.
Information Gain. The clinical trial should have offered an answer which we don't already know.
Pragmatism. The trial should be related to a reallife population and realistic settings, rather than some idealised scenario.
Patientcentered outcome. Some might argue that research should be aligned with the priorities of patients rather than those of investigators or sponsors.
Transparency. The trial authors should be transparent in order for the results to inspire enough confidence to change practice on the basis of its results.
Validity. The trial should be constructed with sufficient methodological quality for its results to be taken seriously.
References
Oh's Intensive Care manual: Chapter 10 (p83), Clinical trials in critical care by Simon Finfer and Anthony Delaney.
JAMA: User's guides to the medical literature; see if you can get institution access to these articles.
The CONSORT statement has its own website and is available for all to peruse.
CASP (Critical Appraisal Skils Program) has checklists for the appraisal of many different sorts of studies; these actually come with tickboxes. One imagines reviewers wandering around a trial headquarters, ticking these boxes on their little clipboards.
Ioannidis, John PA. "Why most clinical research is not useful." PLoS medicine 13.6 (2016): e1002049.
Question 24  2018, Paper 1
Regarding randomised clinical trials:
a) What is a noninferiority trial? (10% marks)
b) What is the null hypothesis in a noninferiority trial? (10% marks)
c) Why would a noninferiority trial be undertaken instead of a superiority trial? (40% marks)
d) What are the limitations of noninferiority trials? (40% marks)
College answer
a)
An active control trial which tests whether an experimental treatment is not worse than the control treatment by more than a specified margin. Originally conceived as “a safe alternative” treatment.
b)
The null hypothesis states that the primary end point for the new treatment is worse than that of the active control by a prespecified margin, and rejection of the null hypothesis at a prespecified level of statistical significance permits a conclusion of noninferiority
c)
Typically, a placebo controlled trial would be considered unethical as an established treatment already exists.
The investigators may consider the experimental treatment unlikely to be superior to established treatment or the current treatment is highly effective.
The experimental treatment may offer advantages such as safety (reduced adverse effects), better compliance, lower cost or more convenience.
d)
Proving that two treatments are equivalent could mean that they are both ineffective or even harmful. Could lead to the acceptance of progressively worse treatments if noninferiority is blindly accepted with repeated noninferiority trials ('biocreep').
Conditions and practice may have changed since the original placebo trial of the current standard treatment.
Equipoise is more complex.
Analysis is more complex
A poorly conducted study tends to “noninferiority” as missing data and protocol violations favour noninferiority.
The margin by which noninferiority is determined is arbitrarily decided by the researchers and may not be clinically appropriate
Sample sizes larger than placebo controlled trials
Examiners Comments:
Very poorly answered. Evidence based medicine is an important part of the curriculum and the examiners were concerned at the low level of knowledge displayed. Some candidates appeared to list unrelated phrases from the EBM literature without any appearance of understanding.
The level of detail given in the template was not required to obtain a passing mark in this question.
Discussion
a) What is a noninferiority trial?
 Superiority trials aim to demonstrate that there is a difference between treatments, i.e. that one treatment is better than another
 Equivalence trials aim to demonstrate that the effects differ by no more than a specific amount (the "equivalence margin").
 Noninferiority trials aim to demonstrate that an experimental treatment is not worse than an active control by more than the equivalence margin
b) What is the null hypothesis in a noninferiority trial?
In superiority trials, the hypothesis is that the experimental treatment is different (better) to the standard treatment, and twosided statistical tests are used to test the null hypothesis (because the experimental treatment could be better or worse). The null hypothesis is therefore that there really is no difference. In equivalence trials the null hypothesis is that the treatments are significantly different, by a specified margin (the "equivalence margin"). In noninferiority trials the null hypothesis is that the experimental treatment is worse than the standard treatment  and the prespecified equivalence margin determines how much worse.
The diagram below is borrowed and modified from Ian A Scott (2009), and demonstrates the results and confidence interval ranges expected of the three different types of trials, when they have demonstrated that the null hypothesis is false.
Superiority trials have to have their results well over to the "favours experimental treatment" side, usually by a prespecified margin. Equivalence trials need to have their results and confidence intervals within that margin to confirm that the two treatments are in fact equivalent. Noninferiority trials also need to have their results within that margin, but there is no need to prove that the treatment is superior (i.e the confidence intervals and results simply need to remain within the not much worse margin, the "+1%" line in the diagram).
c) Why would a noninferiority trial be undertaken instead of a superiority trial?
A noninferiority trial is appropriate when:
 A placebo treatment is unethical
 The standard treatment is exceptionally effective
 The experimental treatment is thought to be equivalent or at least not worse but not superior to the current treatment (i.e. everybody is convinced that a superiority trial would show no difference)
 The experimental treatment is expected to be similar to the standard treatment in terms of the primary outcome, but has other unrelated advantages (eg. is cheaper, less invasive or more convenient) in which case it would be helpful to demonstrate that its' efficacy is not worse.
d) What are the limitations of noninferiority trials?
 The standard of care you test against may be more harmful than placebo.
 Because you are not testing against placebo, a situation may arise where both treatments are similarly harmful, and you have merely demonstrated that your experimental treatment is not any more harmful than the current harmful standard of care.
 Because you are not testing against placebo the effect size difference is smaller, and in order to achieve satisfactory power the sample size needs to be larger (and your trial becomes more expensive).
 If the effect of the standard treatment is very close to the effect of a placebo, then the effect of the supposedly noninferior experimental treatment may end up being very close to the placebo.
 If you test one treatment and prove that it is not much worse, and then test another treatment proving that it is not much worse than the last, you may eventually come to a point where after multiple noninferiority trials you have demonstrated that your terrible useless treatment is not much worse than the other terrible useless treatment, something described as "biocreep", or the acceptance of progressively worse treatments.
 Equipoise is ethically necessary to run these trials, but there may be no equipoise with regards to noninferiority (i.e. some may genuinely believe that the standard treatment is substantially superior to the experimental treatment). Considering that the null hypothesis is that the experimental treatment is much worse, some ethicists may argue that true equipoise is impossible. You basically end up consenting your enrolled patients to agree that they may be randomised to a treatment which is believed to be inferior, or which at best might turn out to be no better.
 A poorly conducted superiority trial (i.e. with many protocol violations and dropouts) will have a result which trends towards noninferiority because through intentiontotreat analysis the effect size of the experimental treatment will be diluted.
 The investigators are in control of the equivalence margin, which means they could have decided on an inappropriately wide margin. If the margin is established after the results become available, the experimental treatment could appear not much worse by manipulating how much worse you would accept as a threshold. Even prespecified margins might be completely arbitrary and inappropriate. There is some pressure to select an inappropriately wide limit  the wider the limit, the smaller the sample size you will require, and the cheaper your trial. This may lead to truly ridiculous conclusions. For instance, Silvio Garattini (2007) describes the COMPASS trial where "the thrombolytic saruplase was judged equivalent to streptokinase for postmyocardial infarction, even though the saruplase group had 50% more deaths than the control group".
 For a drug company, to prove noninferiority of a new drug is less risky than to try to demonstrate their superiority. Failure to demonstrate superiority may stop the product from making its way into the market, and doesn't look as good on the promotional literature.
References
Snapinn, Steven M. "Noninferiority trials." Trials 1.1 (2000): 19.
Lesaffre, Emmanuel. "Superiority, equivalence, and noninferiority trials." Bulletin of the NYU hospital for joint diseases66.2 (2008): 150154.
Murray, Gordon D. "Switching between superiority and non‐inferiority." British journal of clinical pharmacology 52.3 (2001): 219219.
Scott, Ian A. "Noninferiority trials: determining whether alternative treatments are good enough." Med J Aust 190.6 (2009): 326330.
Piaggio, Gilda, et al. "Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement." Jama 308.24 (2012): 25942604.
Garattini, Silvio. "Noninferiority trials are unethical because they disregard patients' interests." The Lancet 370.9602 (2007): 18751877.
Fleming, Thomas R. "Current issues in non‐inferiority trials." Statistics in medicine 27.3 (2008): 317332.
Question 6  2018, Paper 2
Give the rationale for using the following techniques in a randomised controlled clinical trial:
a) Allocation concealment. (30% marks)
b) Block randomization. (30% marks)
c) Stratification. (30% marks)
d) Minimisation algorithm. (10% marks)
College answer
a) Allocation concealment
Procedure for protecting the randomization process and ensuring that the clinical investigators and those involved in the conduct of the trial are not aware of the group to which the subject has been allocated
b) Block randomisation
Simple randomisation may result in unequal treatment group sizes; block randomisation is a method that may protect against this problem and is particularly useful in small trials.
In the context of a trial evaluating drug A or drug B and with block sizes of 4, there are 6 possible blocks of randomisation: AABB, ABAB, ABBA, BAAB, BABA, BBAA.
One of the 6 possible blocks is selected randomly, and the next 4 study participants is assigned according to the order of the block. The process is then repeated as needed to achieve the necessary sample size.
c) Stratification
Stratification is a process that protects against imbalance in prognostic factors/confounders that are present at the time of randomisation.
A separate randomisation list is generated for each prognostic subgroup. Usually limited to 23 variables because of increasing complexity with more variables.
d) Minimisation algorithm
This is an alternative to stratification for maintaining balance in several prognostic variables.
The minimisation algorithm maintains a running total of the prognostic variables in patients that have already been randomised and then subsequent patients are assigned using a weighting system that minimizes imbalance in those prognostic variables.
Discussion
This question is virtually identical to Question 19 from the first paper of 2019, where the trainees were expected to explain these terms rather than offer a rationale for them. The college answer to both questions is identical, suggesting that the examiners do not see any distinction in their wording (or, that they are indifferent to the candidates' interpretation of the question). Either way, for whatever reason the first time around this SAQ did very poorly (only one candidate passed, and barely at that), whereas this time it seems 49.3% scored over 5.0, and some EBM genius scored 8.5.
Without further ado:
Allocation concealment:
 This is a technique of preventing selection bias.
 The selection of patients is randomised, and nobody knows what treatment the next enrolled patient will receive.
 A truly random sequence of allocations prevents the investigators from being able to predict the allocated treatment on the basis of previously allocated treatments.
 Allocation concealment prevents the investigators from predicting who is getting what treatment before the patient is enrolled, whereas blinding prevents the investigators from knowing who is getting what treatment after the patient is enrolled.
Block randomisation:
 The arrangement of experimental subjects in blocks, designed to keep the group numbers the same.
 Usually, the block size is a multiple of the number of treatments (i.e. if it is a binary Drug A vs Drug B trial, the blocks would be in multiples of two).
 Small blocks are better than large blocks.
 The example offered by the college answer is the same example used by Bland and Altman in their classical 1999 article, "How to randomise". That example now, verbatim:

"...sometimes we want to keep the numbers in each group very close at all times. Block randomisation (also called restricted randomisation) is used for this purpose. For example, if we consider subjects in blocks of four at a time there are only six ways in which two get A and two get B: 1:AABB 2:ABAB 3:ABBA 4:BBAA 5:BABA 6:BAAB. We choose blocks at random to create the allocation sequence. Using the single digits of the previous random sequence and omitting numbers outside the range 1 to 6 we get 5623665611. From these we can construct the block allocation sequence BABA/BAAB/ABAB/ABBA/BAAB, and so on. The numbers in the two groups at any time can never differ by more than half the block length. Block size is normally a multiple of the number of treatments."
Stratification:
 Stratification is the partitioning of subjects and results by a factor other than the treatment given.
 Stratification ensures that preidentified confounding factors are equally distributed, to achieve balance. The objective is to remove "nuisance variables", eg. the presence of neutropenic bone marrow transplant recipients in a trial performed on septic patients. One would want to ensure that the treatment group and the placebo group had equal numbers of these haematology disasters.
Minimisation algorithm:
 Minimisation is a method of adaptive stratified sampling.
 The objective is to minimise the imbalance between groups of patients in a clinical trial by ensuring that the treatment group and placebo group each get an equal number of patients with some sort of predetermined characteristics which might act as confounding factors.
 The minimisation algorithm carefully places patients in groups according to the preidentified confounding factors. Only the first patient is randomly allocated.
 Minimisation is methodologically equivalent to true randomisation but does not correct for unknown confounders (only the known predetermined ones)
References
Altman, Douglas G., and J. Martin Bland. "How to randomise." Bmj 319.7211 (1999): 703704.
Question 11  2019, Paper 1
a) What is a Standardised Mortality Ratio (SMR) and how is it calculated? (20% marks)
b) The SMR in your ICU has increased from 0.95 to 1.05 in the past 12 months. Outline the possible causes. (80% marks)
College answer
a) Overview of SMR (20% marks)
SMR is one of the quality indicators that reflect the performance of an ICU.
Definition of SMR = ratio of observed deaths in the study group to expected deaths in the general population based on APACHE or other severity of illness
SMR values of 1 indicate expected performance, whereas values below 1 and above 1 indicate respectively better and worse performances than expected
b) Causes for increase (80% marks)
Lower than expected predicted mortality
Errors in predicted/expected mortality due to gaps in data, changes in casemix etc
Change in data collection systems or personnel – e.g., change in the way the expected mortality is estimated
Leadtime bias (preICU care) – patients transferred from other facilities may have become more stable after receiving appropriate management at the original hospital.
Increases in observed mortality
Based on hospital mortality, not ICU mortality – therefore, influenced by preICU and post ICU care in the hospital
Change in casemix, so changes in case mix may account for increase in SMR and increased other hospital admissions
Oneoff events such as mass disasters, epidemics etc
Variations in practice, changes in clinical protocols either in the hospital or in the ICU Changes in personnel – e.g., new intensivist, new surgeon etc
Changes in staffing levels and training
New services introduced such as ECMO etc.
Examiner’s Comments:
The candidates rarely considered the denominator. Often wrote "admitted sicker patients" without considering these likely to also have higher predicted mortality. Rarely any structure.
Discussion
In brief:
 SMR is the ratio of the observed mortality vs. predicted mortality for a specified time period.
 The formula is SMR = observed number of deaths / expected number of deaths, where the expected number of deaths is predicted by an illness severity scoring system
 One can use this to compare hospitals and ICUs
 One needs to first calculate the predicted hospital mortality using an illness severity scoring system.
 An SMR of 1 means the mortality is as expected.
 An SMR of < 1 is better than expected, and >1 is worse than expected.
Causes for an elevation of the SMR were separated into two categories by the college; either the predicted mortality has dropped, or the actual mortality has increased. Another way of looking at this is whether the SMR elevation is "true", or whether it is spurious, i.e. where the change in SMR is not representative of a change in the quality of care being provided by the ICU.
 Spurious elevation of SMR
 Poor data entry (i.e. true illness severity is not captured by lazy registrars failing to dutifully record every last drop of urine in the APACHE form)
 "Lead time bias"  treatment received prior to ICU admission may result in artifically normalised acute physiology scores
 "Healthy worker effect"  a change towards selective ICU admission practices may be favouring patients who score low on illness severity scales, eg. young elective surgical patients
 True elevation due to internal ICU issues
 A new staffing model is in place (inexperienced staff)
 Understaffing has impaired patient care
 Junior people are not following unfamiliar protocols, or the new protocols are of a poor quality
 New equipment or technique is less useful than advertised
 True elevation due to external problems
 Increase in the prehospital morbidity of admitted patients (eg. increased acuity, where you suddenly become a trauma centre or an organ transplant service)
 Play of chance, eg. mass casualty event
 Deterioration of the quality of preICU care
 Parameters which govern ICU admission have changed (eg. administrative pressure is being placed on the ICU to rapidly admit ED patients who have had little management or workup)
 Discharge arrangements have changed (eg. a local palliative care ward had shut down, and you keep dying patients in the ICU because it would be insensitive to transfer them to the next nearby palliative care unit)
References
Young, Paul, et al. "End points for phase II trials in intensive care: Recommendations from the Australian and New Zealand clinical trials group consensus panel meeting." Critical Care and Resuscitation 15.3 (2013): 211.  this one is not available for free, but the 2012 version still is:
Young, Paul, et al. "End points for phase II trials in intensive care: recommendations from the Australian and New Zealand Clinical Trials Group consensus panel meeting." Critical Care and Resuscitation 14.3 (2012): 211.
Suter, P., et al. "Predicting outcome in ICU patients." Intensive Care Medicine20.5 (1994): 390397.
Martinez, Elizabeth A., et al. "Identifying Meaningful Outcome Measures for the Intensive Care Unit." American Journal of Medical Quality (2013): 1062860613491823.
Tipping, Claire J., et al. "A systematic review of measurements of physical function in critically ill adults." Critical Care and Resuscitation 14.4 (2012): 302.
Gunning, Kevin, and Kathy Rowan. "Outcome data and scoring systems." Bmj319.7204 (1999): 241244.
Woodman, Richard, et al. Measuring and reporting mortality in hospital patients. Australian Institute of Health and Welfare, 2009.
Vincent, JL. "Is Mortality the Only Outcome Measure in ICU Patients?."Anaesthesia, Pain, Intensive Care and Emergency Medicine—APICE. Springer Milan, 1999. 113117.
Rosenberg, Andrew L., et al. "Accepting critically ill transfer patients: adverse effect on a referral center's outcome and benchmark measures." Annals of internal medicine 138.11 (2003): 882890.
Burack, Joshua H., et al. "Public reporting of surgical mortality: a survey of New York State cardiothoracic surgeons." The Annals of thoracic surgery 68.4 (1999): 11951200.
Hayes, J. A., et al. "Outcome measures for adult critical care: a systematic review." Health technology assessment (Winchester, England) 4.24 (1999): 1111.
RUBENFELD, GORDON D., et al. "Outcomes research in critical care: results of the American Thoracic Society critical care assembly workshop on outcomes research." American journal of respiratory and critical care medicine 160.1 (1999): 358367.
Turnbull, Alison E., et al. "Outcome Measurement in ICU Survivorship Research From 1970 to 2013: A Scoping Review of 425 Publications." Critical care medicine (2016).
Solomon, Patricia J., Jessica Kasza, and John L. Moran. "Identifying unusual performance in Australian and New Zealand intensive care units from 2000 to 2010." BMC medical research methodology 14.1 (2014): 1.
Liddell, F. D. "Simple exact analysis of the standardised mortality ratio." Journal of Epidemiology and Community Health 38.1 (1984): 8588.
BenTovim, David, et al. "Measuring and reporting mortality in hospital patients." Canberra: Australian Institute of Health and Welfare (2009).
McMichael, Anthony J. "Standardized Mortality Ratios and the'Healthy Worker Effect': Scratching Beneath the Surface." Journal of Occupational and Environmental Medicine 18.3 (1976): 165168.
Wolfe, Robert A. "The standardized mortality ratio revisited: improvements, innovations, and limitations." American Journal of Kidney Diseases 24.2 (1994): 290297.
Kramer, Andrew A., Thomas L. Higgins, and Jack E. Zimmerman. "Comparing observed and predicted mortality among ICUs using different prognostic systems: why do performance assessments differ?." Critical care medicine 43.2 (2015): 261269.
Spiegelhalter, David J. "Funnel plots for comparing institutional performance." Statistics in medicine 24.8 (2005): 11851202.
Teres, Daniel. "The value and limits of severity adjusted mortality for ICU patients." Journal of critical care 19.4 (2004): 257263.
Question 22  2019, Paper 1
Outline the features and list the advantages and disadvantages of each of the following clinical trial designs:
a) Cluster randomised trial. (50% marks)
a) Noninferiority trial. (50% marks)
College answer
Cluster randomised trial (10%)
Unit of randomisation is the cluster (e.g. one hospital or ICU) rather than individual patients. Individual clusters may be matched / paired with similar clusters to increase power
Power increased more by increasing number of clusters rather than increased numbers of patients within clusters
Advantages (20%)
Ability to test interventions directed at systems rather than individuals (e.g. MET, SDD, education campaigns)
Where individual patients not consented may lead to recruitment of ‘all’ patients with the entry criteria
–increased recruitment and external validity
Disadvantages (20%)
Larger numbers of patients required when compared to conventional individual patient RCT i.e. reduced statistical efficiency
Complex statistics: power calculation require knowledge or estimate of intercluster correlation coefficient
Chance of getting imbalance is greater depending on the characteristics of the cluster
b)
Noninferiority trial (10%)
The null hypothesis in a noninferiority study states that the primary end point for the experimental treatment is worse than that for the positive control
treatment by a specified margin. Rejection of the null hypothesis supports a claim of noninferiority the control treatment
Advantages: (20%)
Allows investigation of a new therapy to be compared to an existing accepted therapy Does not require a placebo group, where this may be unethical
Allows cheaper or less toxic therapies to the introduced in place of older therapies
Disadvantages: (20%)
Does not prove efficacy of tested therapy Relies upon known / accepted benefit of control
Needs to be performed under similar conditions in which the active control has demonstrated benefit No clear consensus on what margin of noninferiority should be accepted
Repeated noninferiority trial may lead to acceptance of inferior therapies ‘biocreep’
Examiners Comments:
Significant knowledge gap. Disappointing, since several important trials have followed these designs.
Discussion
The disappointment felt at the 4.5% pass rate for this question underscores the need to promote formal training in statistics and literature analysis. Other colleges have already moved to such a strategy, where their trainees may dispense with the increasingly pointless formal project (a mandatory requirement to generate meaningless papers) by satisfying their research requirements though a university unit of study in interpretation of evidencebased medicine.
In summary:
Features of a clusterrandomised trial:
 Groups of patients rather than individuals are randomised
 A group may be as large as a hospital or an ICU
 This is done because sometimes, it would be totally impractical to randomise an intervention to each individual patient; for example where the intervention is a large scale organisational change
 The number of patients in each cluster does not matter as much as the total number of clusters, and power design involves deciding how many clusters one requires (patients within a cluster are more likely to have similar outcomes).
 The outcome for each patient can no longer be assumed to be independent of that for any other patient,
Advantages of a clusterrandomised trial:
 Able to test interventions applied to whole services or communities
 Increased logistical convenience (less difficulty than individual randomisation)
 Greater acceptability by participants (when something viewed as a worthwhile intervention is delivered to a large group rather than to individuals)
 Both the direct and indirect effects of an intervention can be captured in a population, i.e. the study is more pragmatic (a good example is a study of infectious disease: not only do the randomised participants benefit from a decontaminatingtreatment, but also the population who are exposed to them)
 This increases the external validity
Disadvantages of a clusterrandomised trial:
 The statistical power of a cluster randomised trial is greatly reduced in comparison with a similar sized individually randomised trial (Campbell & Grimshaw, 1998)
 The number of patients required may be twice or thrice that of a comparable individually randomised trial
 To calculate the power of such a trial requires a specialised approach. The intracluster correlation coefficient needs to be taken into account, as standard power calculations will lead to an underpowered trial if it is analysed taking clustering into account.
 Analysis needs to take into account the cluser design: "If the clustering effect is ignored p values will be artificially extreme, and confidence intervals will be overnarrow, increasing the chances of spuriously significant findings and misleading conclusions". Apparently, this adjustment does not routinely happen.
Features of a noninferiority trial
 Noninferiority trials aim to demonstrate that an experimental treatment is not worse than an active control by more than the equivalence margin.
 In superiority trials, the hypothesis is that the experimental treatment is different (better) to the standard treatment, and twosided statistical tests are used to test the null hypothesis (because the experimental treatment could be better or worse). The null hypothesis is therefore that there really is no difference. In equivalence trials the null hypothesis is that the treatments are significantly different, by a specified margin (the "equivalence margin"). In noninferiority trials the null hypothesis is that the experimental treatment is worse than the standard treatment  and the equivalence margin determines how much worse.
Advantages of a noninferiority trial:
A noninferiority trial is appropriate when:
 A placebo treatment is unethical
 The standard treatment is exceptionally effective
 The experimental treatment is thought to be equivalent or at least not worse but not superior to the current treatment (i.e. everybody is convinced that a superiority trial would show no difference)
 The experimental treatment is expected to be similar to the standard treatment in terms of the primary outcome, but has other unrelated advantages (eg. is cheaper, less invasive or more convenient) in which case it would be helpful to demonstrate that its' efficacy is not worse.
Disadvantages of noninferiority trials
 The standard of care you test against may be more harmful than placebo.
 Because you are not testing against placebo, a situation may arise where both treatments are similarly harmful, and you have merely demonstrated that your experimental treatment is not any more harmful than the current harmful standard of care.
 Because you are not testing against placebo the effect size difference is smaller, and in order to achieve satisfactory power the sample size needs to be larger (and your trial becomes more expensive).
 If the effect of the standard treatment is very close to the effect of a placebo, then the effect of the supposedly noninferior experimental treatment may end up being very close to the placebo.
 If you test one treatment and prove that it is not much worse, and then test another treatment proving that it is not much worse than the last, you may eventually come to a point where after multiple noninferiority trials you have demonstrated that your terrible useless treatment is not much worse than the other terrible useless treatment, something described as "biocreep", or the acceptance of progressively worse treatments.
 Equipoise is ethically necessary to run these trials, but there may be no equipoise with regards to noninferiority (i.e. some may genuinely believe that the standard treatment is substantially superior to the experimental treatment). Considering that the null hypothesis is that the experimental treatment is much worse, some ethicists may argue that true equipoise is impossible. You basically end up consenting your enrolled patients to agree that they may be randomised to a treatment which is believed to be inferior, or which at best might turn out to be no better.
 A poorly conducted superiority trial (i.e. with many protocol violations and dropouts) will have a result which trends towards noninferiority because through intentiontotreat analysis the effect size of the experimental treatment will be diluted.
 The investigators are in control of the equivalence margin, which means they could have decided on an inappropriately wide margin. If the margin is established after the results become available, the experimental treatment could appear not much worse by manipulating how much worse you would accept as a threshold. Even prespecified margins might be completely arbitrary and inappropriate. There is some pressure to select an inappropriately wide limit  the wider the limit, the smaller the sample size you will require, and the cheaper your trial. This may lead to truly ridiculous conclusions. For instance, Silvio Garattini (2007) describes the COMPASS trial where "the thrombolytic saruplase was judged equivalent to streptokinase for postmyocardial infarction, even though the saruplase group had 50% more deaths than the control group".
 For a drug company, to prove noninferiority of a new drug is less risky than to try to demonstrate their superiority. Failure to demonstrate superiority may stop the product from making its way into the market, and doesn't look as good on the promotional literature.
References
Snapinn, Steven M. "Noninferiority trials." Trials 1.1 (2000): 19.
Lesaffre, Emmanuel. "Superiority, equivalence, and noninferiority trials." Bulletin of the NYU hospital for joint diseases66.2 (2008): 150154.
Murray, Gordon D. "Switching between superiority and non‐inferiority." British journal of clinical pharmacology 52.3 (2001): 219219.
Scott, Ian A. "Noninferiority trials: determining whether alternative treatments are good enough." Med J Aust 190.6 (2009): 326330.
Piaggio, Gilda, et al. "Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement." Jama 308.24 (2012): 25942604.
Garattini, Silvio. "Noninferiority trials are unethical because they disregard patients' interests." The Lancet 370.9602 (2007): 18751877.
Fleming, Thomas R. "Current issues in non‐inferiority trials." Statistics in medicine 27.3 (2008): 317332.
Campbell, Marion K., and Jeremy M. Grimshaw. "Cluster randomised trials: time for improvement: the implications of adopting a cluster design are still largely being ignored." (1998): 11711172.
Question 24  2019, Paper 2
In the context of clinical trials what is meant by the following terms:
a) Stratification. (20% marks)
b) Intention to treat analysis. (20% marks)
c) Sensitivity analysis. (20% marks)
d) KaplanMeir curve. (20% marks)
e) Analysis of competing risk. (20% marks)
College answer
a) Stratification of clinical trials is the partitioning of subjects and results by a factor other than the treatment given
b) Intention to treat analysis is the analysis of all participants allocated to a treatment group irrespective of whether they completed the treatment, withdrew, or deviated from protocol.
c) A sensitivity analysis is the analysis of data from the trial with a change or alteration to one or more underlying assumptions used in the original analysis.
d) A KaplanMeir curve is a plot of probability of survival against time.
e) Analysis of competing risk is used when there are multiple endpoints of which the occurrence of one prevents the occurrence of another (e.g. death prevents the occurrence of shock reversal
Discussion
Stratification
 Stratification is the partitioning of subjects and results by a factor other than the treatment given.
 Stratification ensures that preidentified confounding factors are equally distributed, to achieve balance. The objective is to remove "nuisance variables", eg. the presence of neutropenia in a trial performed on septic patients. One would want to ensure that the treatment group and the placebo group had equal numbers of these haematology disasters.
 According to Question 19 from the first paper of 2016, the official Delaney definition of stratification is as follows:
"Stratification is a process that protects against imbalance in prognostic factors that are present at the time of randomisation.
A separate randomisation list is generated for each prognostic subgroup. Usually limited to 23 variables because of increasing complexity with more variables"
Intention to treat analysis
 "Once randomised, always analysed"
 All enrolled patients have to be a part of the final analysis
 This preserves the biasprotective effect of randomisation
 Minimises Type 1 errors (false positives)
 When intentiontotreat analysis agrees with perprotocol analysis, it increases the validity of the study
Sensitivity analysis
 Analysis of the data from a clinical trial where some of the assumptions are intentionally changed
 One example of this is to assume that all the patients lost to followup or who dropped out of the study have failed treatment.
"KaplanMeir" curve (it's usually spelled "Meier", after Paul Meier):
 A KaplanMeier curve is defined as the probability of surviving in a given length of time while considering time in many small intervals
 The curve itself is a plot of the fraction of patients surviving in each group over time
Analysis of competing risk:
 A competing risk is an event that either hinders the observation of the event of interest or modifies the chance that this event occurs
 An example is death while on dialysis and getting a kidney transplant (the two eventsinterfere with one another)
 Conventional methods (eg. Kaplan–Meier and standard Cox regression) ignore the competing events and may not be appropriate and competing risk analsysi methods must be employed
References
Morris, Tim P., Brennan C. Kahan, and Ian R. White. "Choosing sensitivity analyses for randomised trials: principles." BMC medical research methodology 14.1 (2014): 11.
Rich, Jason T., et al. "A practical guide to understanding KaplanMeier curves." Otolaryngology—Head and Neck Surgery 143.3 (2010): 331336.
Noordzij, Marlies, et al. "When do we need competing risks methods for survival analysis in nephrology?." Nephrology Dialysis Transplantation 28.11 (2013): 26702677.
Question 26.1  2020, Paper 1
A randomised controlled trial examining a treatment for septic shock reports the following results:
"At 90 days after randomization, 27.9% patients who had been assigned to receive the treatment had died, as had 28.8% who had been assigned to receive placebo (odds ratio 0.95; 95% confidence interval [Cl], 0.82 to 1.10; P value= 0.50)."
a) Explain the meaning of the underlined terms. Interpret the result of the trial.
(40% marks)
College answer
Odds ratio: The odds of a patient in the treatment group dying within 90 days divided by the odds of patients in the placebo group dying within 90 days.
95% confidence interval: The range of values which is 95% certain to contain the population parameter of interest (in this case, Odds Ratio)
P Value: The probability of obtaining the observed, or more extreme results, assuming the null hypothesis is true. (3 marks)
Discussion
In case it matters to anybody, in this SAQ the examiners are using the findings of the ADRENAL trial (Venkatesh et al, 2018)
Odds ratio:
 The Odds Ratio represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
 An OR =1 suggests there is no association.
 If the CI for an OR includes 1, then the OR is not significant (i.e. there might not be an association)
Confidence interval:
 The range of values within which the "actual" result is found.
 A CI of 95% means that if a trial was repeated an infinite number of times, 95% of the results would fall within this range of values.
 The CI gives an indication of the precision of the sample mean as an estimate of the "true" population mean
 A wide CI can be caused by small samples or by a large variance within a sample.
pvalue:
 The probability of the observed result arising by chance
 The pvalue is the chance of getting the reported study result (or one even more extreme) when the null hypothesis is actually true.
Interpretation of results:
 The OR is so close to 1 that there is probably no association if the results were significant, i.e. there is no treatment effect detected here.
References
Question 26.2  2020, Paper 1
A randomised controlled trial examining a treatment for lung injury reports the following results:
"The primary outcome was change in SOFA score over 96 hours. The mean SOFA score from baseline to 96 hours decreased from 9.8 to 6.8 in the treatment group (3 points) and from 10.3 to 6.8 in the placebo group (3.5 points) (difference, 0.10; 95% CI,  1. 23 to 1.03; P = 0.86).
There were 30 prespecified secondary outcomes. Twentynine were not significantly different between the treatment and the placebo group. In exploratory analyses that did not adjust for multiple comparisons, day 28 mortality was 46.3% in the placebo group vs 29.8% in the treatment group (P = 0.03; betweengroup difference, 16.58% [95% CI, 2% to 31.1%))."
a) Interpret these results. (30% marks)
College answer
a)
The primary outcome does not demonstrate a significant difference between the two groups and so the overall result of the trial is negative. A secondary outcome of day 28day mortality does show a significant difference in favour of the treatment – however as this is one of 30 secondary outcomes, with no adjustment for multiplicity of testing this is likely a false positive result and should be interpreted cautiously. (3 marks)
Discussion
Thee findings borrowed for this SAQ come from the CITRISALI trial (Truwit et al, 2019), in case anybody cares.
The primary outcome is not statistically significant because of the high pvalue (0.86 is pretty terrible) and because the confidence interval crosses over 1.0
As to the secondary outcome. If you have thirty (and ultimately CITRISALI had fortysix) some of them are bound to produce some sort of publishable information. Day 28 mortality difference was statistically significant (p= 0.03), but because this is a secondary outcome, it should be viewed as hypothesisgenerating. On an unrelated note, mortality of 46% in sepsis or ARDS is so 1990s.
References
Truwit, Jonathon D., et al. "Effect of vitamin C infusion on organ failure and biomarkers of inflammation and vascular injury in patients with sepsis and severe acute respiratory failure: the CITRISALI randomized clinical trial." Jama 322.13 (2019): 12611270.
Question 26.3  2020, Paper 1
A prospective observational study examining the association between fluid therapy and outcome reports the following results:
"Crude 90day mortality of patients who received colloids was higher than in patients treated exclusively with crystalloids; (25.5% vs. 15.4%, odds ratio (OR) 1.84, 95% confidence interval (Cl) 1.56 to 2.18). After multiple logistic regression analysis, the adjusted OR was 0.923, 95% Cl (0.87 to 1.19), p = 0.09."
a) Interpret these results. (30% marks)
College answer
a)
There was a significantly higher mortality in patients who received colloids compared to those who received crystalloids. However, when other factors likely to influence mortality were taken into account by multiple logistic regression analysis, the difference was no longer statistically significant. The interpretation is that fluid choice is not significantly associated with 90day mortality. (3 marks)
Discussion
The data here comes from Ertmer et al, 2011.
The crude odds ratio here appears statistically significant as the CI is well away from 1.0. The effect size there is also significant. There is no p value reported, which is unhelpful. The adjusted OR is very different and is in fact the opposite of the crude OR, which raises major concerns. The "multiple logistic regression analysis" would have to be more carefully scrutinised to determine which variables they threw into the soup. Usually, the investigators just choose whichever variables had a pvalue below 0.05 in the first univariate analysis. The more intelligent method would be to test the independent variables in pairs and in groups to understand the meaning behind their interaction, and then pick only the meaningful variables for your multivariate analysis. In short, there was no difference in mortality, according to the presented fragments of data.
References
Ertmer, Christian, et al. "Fluid therapy and outcome: a prospective observational study in 65 German intensive care units between 2010 and 2011." Annals of intensive care 8.1 (2018): 27.