Review Evidence-based Medicine

Aims to apply evidence from the highest-quality research studies to the practice of medicine

□

Also known as evidence-based practice

□

Findings from the best-designed and most rigorous studies have the greatest influence on clinical decision making.

□

The levels of evidence in medical research (see Fig. 13.1): a hierarchy for various research applications and questions based on several factors affecting the quality of a research design

□

Diagnostic, prognostic, or therapeutic research designs with higher levels of evidence have a greater influence on clinical recommendations. Many factors may affect the quality of a research design (see later discussion of flaws in research designs); the levels of evidence of various study designs are as follows:

•

Level I: high-quality clinical trials (randomized, controlled, blinded, etc.)

•

Level II: cohort studies or lesser-quality clinical trials

•

Level III: case-control studies

•

Level IV: case series studies

•

Level V: expert opinions

Evidence based medicine table

Review Clinical Research Designs

Clinical study design is an essential element of research that the study team must determine in advance of initiating the study.

▪

Prospective studies are designed to start in the present and collect data forward in time. For example, an exposure or potential risk factor has occurred and patients are followed forward in time to determine the occurrence of an outcome of interest.

▪

Retrospective studies are designed to assess outcomes that have already occurred or data that have been collected in the past. Chart review of medical records is a typical application of retrospective research designs in orthopaedic research.

▪

Longitudinal studies involve repeated assessments over a long period. A longitudinal study can also be performed on historical (retrospective) data.

▪

Observational research designs can be prospective, retrospective, or longitudinal.

Types of observational studies

Case reports:

•

Descriptions of unique injures, disease occurrences, or outcomes in a single patient

•

No attempts at advanced data analysis are made.

•

Cause-and-effect relationships and generalizability are not determined.

□

Case series:

•

Outcomes are measured in patients with a similar disease/injury to determine outcomes retrospectively.

•

No attempts are made to estimate frequencies or distributions.

□

Case-control studies:

•

Outcomes measured in patients with similar disease/injury are compared with a control group (see later discussion of flaws in research designs for more information about control groups).

•

Odds ratios (not relative risks) are appropriate measures of association from data collected in these study designs (see later Concepts in Epidemiologic Research Studies).

□

Cohort study:

•

Groups of patients with a similar characteristic or exposure/risk factor are studied forward in time (prospective) or from existing data (retrospective).

•

Cohort studies are appropriate for estimating incidence of disease/injury and relative risks.

□

Cross-sectional study:

•

A specific patient population is studied at a given point in time.

•

All measurements are made at once with no follow-up period.

•

Considered “snapshot” that is useful for describing the prevalence of a particular injury/disease of interest at a particular point in time.

Review experimental study designs

A clinical trial is designed to allocate treatments and track outcomes prospectively to test a specific hypothesis. Clinical trials are costly, take a great deal of time, money, and resources and a comprehensive research team (often at multiple patient enrollment sites) to accomplish their aims.

•

The gold standard, and the type of clinical trial that produces the highest level of evidence, is the randomized controlled trial (RCT).

□

Clinical trials with parallel design: treatments are allocated to different subjects/patients in random or nonrandom manner.

•

Example: patients are randomly assigned to receive one of the study interventions only. This allocation is typically randomized and blinded (see discussion of prior blinding and randomization).

□

Clinical trials with crossover designs: each subject receives two or more interventions in a predetermined or random order.

•

Patients are followed prospectively for a period while receiving treatment A, then start receiving treatment B and are followed for an additional period. One of the “treatment conditions” can be a control condition.

□

Clinical studies can be designed to determine superiority of one treatment over another or to determine whether one treatment is no worse than another (noninferiority) or just as effective (equivalency).

What are common flaws in reseach design?

**Confounding variables are factors extraneous to a research design that potentially influence the outcome.** Conclusions regarding cause-and-effect relationships may be explained by confounding variables instead of by the treatment/intervention being studied and must therefore be controlled for in the research design (via matching, randomization, etc.) or accounted for in statistical analyses (see later discussion of ANCOVA).

▪

**Bias is unintentional systematic error that will threaten the internal validity of a study. Kinds of bias include selection (sampling) bias, nonresponder (loss to follow-up) bias, observer/interviewer bias, and recall bias.**

▪

Protection against these threats can be achieved through randomization (i.e., random allocation of treatment) to ensure that bias and confounding factors are equally distributed among the study groups. Single blinding (examiner or patient) or double blinding (examiner and patient) is important for minimizing bias.

▪

Control groups can help account for potential placebo effect of interventions.

□

Control groups may receive a standard-of-care intervention, no intervention, a placebo (i.e., inactive substance), or sham intervention.

□

Control data may have been collected in the past (historical controls) or may occur in sequence with other study interventions (crossover design).

□

Control subjects are often matched on the basis of specific characteristics (e.g., gender, age), a process that helps account for potential confounding sources that may influence the impact of research findings.

▪

**The strongest clinical trial design uses randomly allocated, blinded and concurrent, matched controls.**

**□**

Descriptive and controlled laboratory studies are common in basic science research and may include similar designs and statistical methods to protect against sources of bias and confounding.

□

Design flaws may challenge the internal or external validity of a research study. Internal validity describes the quality of a research design and how well the study is controlled and can be reproduced. External validity is the ability of a study’s results to be generalized or applied to a whole population of interest.

▪

S**tudy populations in clinical research studies are delimited by inclusion and exclusion criteria.** **During a screening process, clinical researchers carefully review all inclusion and exclusion criteria to determine eligibility for participation in a clinical research study or clinical trial.**

□

Inclusion and exclusion criteria are written to target a specific patient population for a clinical research study. The narrower a patient population becomes, the less confounded or biased, but also the less generalizable, study findings will be.

□

Inclusion criteria are specific characteristics that are identified to best describe a target population. Sex, age, race, primary diagnosis, and procedure are all examples of inclusion criteria. In clinical research, to be included in a study, the response to all inclusion criteria must be affirmative (i.e., “yes”).

□

Exclusion criteria are specific characteristic that, when present, would disqualify a potential participant from the study. For the participant to be included in a clinical research study, all exclusion criteria must be negative or ruled out.

How many subjects are needed to complete a research study?

**Research studies should have enough subjects/samples to get valid results that can be generalized to a population while minimizing unnecessary work or risk to subjects.**

**▪**

**Sample size estimates are based on the desired statistical power (often termed power analyses).**

**□**

**Statistical power is the probability of finding differences among groups when differences actually exist (i.e., avoiding type II error).**

**□**

**We want to be able to find these differences with our statistical tests 80% of the time or more.**

**▪**

**Sample sizes are justified as the number of subjects needed to find a statistically significant difference or association (i.e., P <0.05) while maintaining statistical power greater than 80%.**

**▪**

**Higher sample sizes and/or highly precise measurements (lower variability) are necessary to find small differences between study groups.**

**▪**

**Power analyses can be done before the study starts (a priori) or after the study has been completed (post hoc).**

**▪**

**Studies with low power have higher likelihood of missing statistical differences (or relationships) when they actually exist (i.e., type II error).**

**▪**

**Sample sizes are calculated to determine the number of subjects needed to study a specific outcome measure. It is important to identify a primary outcome measure in order to determine sample size for a research study.**

**▪**

**Studies that have multiple outcome measures may need multiple sample size estimates to ensure all outcomes are appropriately “powered.”**

what outcomes should be included in your research study?

Selecting the most appropriate outcome for a study is an important decision made in advance by the research team.

□

Primary outcome measures match the primary purpose of the study.

□

Secondary and tertiary outcomes may be included as additional (sometimes exploratory) measures that are important to achieve the goals of the study.

•

Typically, sample size estimates for a study are based on the primary outcome measure.

▪

Subjective data are opinions, judgments, or feelings (e.g., in clinical research, patient-reported outcomes are subjective). Objective data are measured by a valid or reliable instrument (see discussion Validity and Reliability).

Parametric vs Non-parametric

Parametric statistics are appropriate for continuous data and rely on the assumption that data are normally distributed.

•

Nonparametric statistics are appropriate for categorical and non–normally distributed data.

What is the confidence interval?

The confidence interval (CI) quantifies the precision of the mean or other statistic, such as an odds ratio (OR) or relative risk (RR).

•

Datasets that are highly variable (large SDs) have larger CIs and hence are less accurate estimates of the characteristics of a population.

•

A 95% CI consists of a range of values within which we are 95% certain that the actual population parameter [mean/OR/RR] lies.

•

Example: mean = 40.5 [95% CI, 35.5–45.5] indicates that we are 95% confident that the population mean lies somewhere between 35.5 and 45.5.

what is the difference between incidence and prevalence?

Prevalence is the proportion of existing injuries/disease cases conditions within a particular population.

□

Incidence (absolute risk) is the proportion of new injuries/disease cases within a specified time interval (requires a follow-up period).

•

Can be reported with respect to the number of exposures.

•

Example: if 12 of 100 athletes on a sports team experience a sports injury over a 10-game season, the incidence rate would be 12 injures per 1000 athlete exposures.

what is Relative Risk?

RR is a ratio between the incidences of an outcome in two cohorts. Typically a treated/exposed cohort (in the numerator of the ratio) is compared with an untreated (control) group/unexposed group (in the denominator of the ratio). Values can range from 0 to infinity and are interpreted as follows:

•

RR = 1.0: indicates the incidences of an outcome are equal in the two groups.

•

RR >1.0: indicates the incidence of an outcome is greater in the treated/exposed group (higher incidence value in the numerator).

•

RR <1.0: indicates the incidence of an outcome is greater in the untreated/unexposed group (higher incidence value in the denominator).

What is Odds Ratio?

OR is calculated as a ratio between the probabilities of an outcome in two cohorts.

•

ORs are well suited for binary data or studies in which only prevalence can be calculated.

Know this!

How to interpret RR vs OR?

Interpreting RR and OR

•

OR and RR values are interpreted similarly.

•

In the comparison of outcomes between two groups, an RR or OR value of 0.5 would indicate that treated/exposed patients have half the likelihood of experiencing a particular outcome than that for the untreated/control group.

•

A value of 2.5 would indicate that a treated/exposed group would have a 2.5 times greater likelihood of experiencing the outcome than the untreated/control group.

•

An RR or OR whose CI crosses 1 is not considered to be “significant.”

Review Sensitivity

Sensitivity:

•

The likelihood of positive test results in patients who actually DO have the disease/condition of interest (i.e., ability to detect true positives among those with a disease)

•

Calculated as the proportion of patients with a disease/condition of interest who have a positive diagnostic test result:

•

Total patients with the disease of interest = true positives + false negatives

•

Sensitive tests are used for screening because they have few false-negative results. They are unlikely to miss an affected individual.

•

When the result of a highly sensitive (Sn) test is negative, the condition can be ruled OUT (mnemonic: SnOUT).

Review Specificity

The likelihood of negative test results in patients who actually DO NOT have the disease/condition of interest (i.e., ability to detect true negatives among those without a disease)

•

Calculated as the proportion of patients without a disease/condition of interest who have a negative test result:

•

Total patients without the disease or condition of interest = true negatives + false positives

•

Specific tests are used for confirmation because they are tests that have few false-positive results and are therefore unlikely to result in false treatment of a healthy individual.

•

When the result of a highly specific (Sp) test is positive, the condition can be ruled IN (mnemonic: SpIN).

Review Positive Predictive Value

the likelihood that patients with positive test results actually DO have the disease/condition of interest

•

Calculated as the proportion of patients who have a positive test result and actually have the disease of interest (i.e., correctly diagnosed with a positive test result):

•

Total number of patients who tested positive = true positives + false positives

Review Negative Predictive value

the likelihood that patients with a negative test result actually DO NOT have the disease/condition of interest

•

Calculated as the proportion of patients with a negative test result who do not have the disease of interest (i.e., correctly diagnosed with a negative test):

•

Total number of patients who tested negative = true negatives + false negatives

What is positive predictive value?

Positive predictive value: the likelihood that patients with positive test results actually DO have the disease/condition of interest

•

Calculated as the proportion of patients who have a positive test result and actually have the disease of interest (i.e., correctly diagnosed with a positive test result):

•

Total number of patients who tested positive = true positives + false positives

What is negative predictive value?

Negative predictive value: the likelihood that patients with a negative test result actually DO NOT have the disease/condition of interest

•

Calculated as the proportion of patients with a negative test result who do not have the disease of interest (i.e., correctly diagnosed with a negative test):

•

Total number of patients who tested negative = true negatives + false negatives

Review likelyhood ratios

Probability that a disease exists, given a test result; likelihood ratios consider both specificity and sensitivity of a given test.

•

Likelihood ratios close to 1.0 provide little confidence regarding presence/absence of a disease.

•

Positive likelihood ratios greater than 1.0 indicate higher probability of disease when diagnostic test result is positive.

•

Calculated as the ratio between the true-positive rate (sensitivity) and the false-positive rate (1 − specificity):

•

Negative likelihood ratios less than 1.0 indicate higher probability that the disease is absent given a negative test result.

•

Calculated as the ratio between the false-negative rate (1 − sensitivity) and the true-negative rate (specificity):

•

Receiver operating characteristic curves are graphical representations

Review Receiver Operator Characteristics

Receiver operating characteristic curves are graphical representations of the overall clinical utility of a particular diagnostic test that can be used to compare accuracy of different tests in diagnosing a particular condition (Fig. 13.5).

•

Tradeoffs between sensitivity and specificity must be considered in the identification of the best diagnostic tests.

•

ROC curves plot the true-positive rate (sensitivity) and the false-positive rate (1-specificity) on a graph.

•

The area under the ROC curve ranges from 0.5 (useless test, no better than a random guess) to 1.0 (perfect diagnostic ability).

Stats Stats and More stats

Statistical tests are prescribed to match the purpose and design of a particular research study. Statistical tests are used to answer research questions. Statistics are merely tools to describe data and make inferences. Interpretation of statistical findings is left to expert scientists and clinicians.

▪

Statistical analyses differ according to whether a researcher wants to compare groups to identify differences, establish relationships between groups, and so on (Table 13.1).

▪

Inferential statistics are used to test specific hypotheses about associations and/or differences among groups of subject/sample data.

□

The dependent variable is what is being measured as the outcome. There can be multiple dependent variables depending on how many outcome measures are desired.

□

The independent variables include the conditions or groupings of the experiment that are systematically manipulated by the investigator.

•

For example, a researcher is measuring pain and prescription medication use in patients receiving treatment A or B or C in patients with shoulder pain. The dependent variables are “pain” and “prescription medicine use.” The independent variable is “treatment condition” with three levels, “A,” “B,” and “C.”

□

Inferential statistics can be generally divided into parametric tests and nonparametric tests. The goal of inferential statistics is to estimate parameters; therefore the default should be to parametric tests. Nonparametric alternatives are justified if the basic underlying assumptions for using parametric statistics are violated or if the sample sizes are very small.

•

Parametric statistics are appropriate for continuous data and rely on the assumption that data are normally distributed.

•

They use the mean and SD when comparing groups or identifying associations.

•

The mean of a dataset is greatly influenced by outliers, so these tests may not be as robust for skewed datasets.

•

Nonparametric statistics are appropriate for categorical and non–normally distributed data.

•

They use the median and ranks as more robust alternatives when data are non-normally distributed.

what statistical test do you use for categorical data?

•

Chi-square (χ2) test

•

Used for two or more groups of categorical data

•

Example: to compare treatment A versus B when the outcome is either “satisfied or unsatisfied,” the chi-square test can be used to identify relationships between “treatment condition” and “outcome category.”

•

If the result of the test is statistically significant, frequencies of each outcome in the two treatment groups can be visually compared to describe which treatment is superior.

•

Fisher exact test

•

Similar to the chi-square test but better for small sample sizes or when the number of occurrences in one of the categories is low (e.g., if only one patient in treatment group A had an unsatisfactory outcome, this test is preferred)

Review the flow chart for choosing the right statistical test

What statistical tests can you use for continuous data testing?

When two groups of data are compared, the t-test is used; there are two variations:

•

Dependent (paired) samples t-test:

•

Appropriate for comparing continuous, normally distributed data collected two times on the same subjects

•

Example: two time points measured in the same patient (e.g., before/after intervention)

•

Also appropriate for side-by-side comparison within the same subject or in matched pairs of subjects

•

Nonparametric equivalent: Wilcoxon signed rank test.

•

Independent samples t-test

•

Appropriate for comparing continuous, normally distributed data from two separate groups

•

Example: two groups of patients who received different treatments

•

Nonparametric equivalent: Mann-Whitney U test

•

ANOVA is appropriate to compare three or more groups of continuous, normally distributed data.

•

Nonparametric equivalent: Kruskal-Wallis test

•

Repeated measures ANOVA is a variation of the ANOVA test that is appropriate for sequential measurements recorded on the same subjects.

•

For example, this test would be used to compare a dependent variable (outcome measure) recorded at three or more time points (baseline, 1 month post intervention, 2 months post intervention).

•

Nonparametric alternative: Friedman test

•

Multivariate ANOVA (MANOVA): variation of the ANOVA test that is used when multiple dependent variables are compared among three or more groups

•

Analysis of covariance (ANCOVA) is an appropriate test when confounding factors must be accounted for in the statistical test.

•

Post hoc testing is necessary after any ANOVA test to determine the exact location of differences among groups.

•

ANOVA tests describe whether or not a statistically significant difference exists somewhere among the study groups.

•

For example, in a comparison of three levels of the independent variable treatment condition (A, B, or C), post hoc testing will specifically compare A vs. B, B vs. C, and A vs. C to determine the exact locations of group differences. Post hoc testing is appropriate only if the ANOVA test is statistically significant (see later section).

•

Common post hoc tests: Tukey HSD, Šidák, Dunnett, Scheffe

•

Factorial designs for multiple independent variables

•

Hypotheses regarding an interaction among three different treatment groups from pre/post intervention will have a 2 × 3 factorial design.

•

“2 × 3” indicates two independent variables; for example, the first (time) has two levels, pretest and post test, and the second (treatment condition) has three levels, treatments A, B, and C.

Describe Correlation

Correlation coefficients

•

Describe the strength of a relationship between two variables

•

Pearson product correlation coefficient (r) used for continuous normally distributed data

•

Spearman rho correlation coefficient (ρ) is the nonparametric equivalent.

•

Values range from −1.0 to 1.0; less than ±0.33 are “weak,” between ±0.33 and ±0.66 are “moderate,” and more than ±0.66 are “strong.” Positive values are direct relationships; negative values are indirect relationships.

•

Positive correlation coefficients indicate direct relationships suggesting that patients who scored high on one scale also score high on the other.

•

Negative correlation coefficients indicate inverse/indirect relationships suggesting that patients who score high on one scale score low on the other.

review regression?

Simple linear regression

•

Describes the ability of one independent (predictor) variable to predict a dependent variable (outcome) variable

•

The coefficient of determination (R2) is the square of r (Pearson product correlation coefficient) and indicates the proportion of variance explained in one variable by another.

•

R2 ranges from 0 to 1.0, in which higher values indicate more variance explained.

•

Multivariate linear regression describes the ability of several independent variables to predict a dependent variable.

•

Logistic regression is used when the outcome is categorical and the predictor variables can be either categorical or non–normally distributed continuous data.

What is the difference accuracy and reliability?

Can be assessed using statistical techniques similar to correlation coefficients

□

The intraclass correlation coefficient evaluates agreement between two measures on the same scale.

▪

Accuracy/validity

□

An instrument or test with the ability to accurately describe truth/reality is said to be valid.

□

A validation study is designed to compare measures recorded from a gold-standard method with a new or experimental method. The data should be on the same measurement scale to determine agreement between the two instruments or techniques.

▪

Precision/reliability

□

The ability to precisely describe a characteristic with repeated measurements can be tested statistically.

□

The precision of an instrument or technique can be tested for interobserver (measures taken by different examiners on the same patient) or intraobserver (reliability of measures recorded by the same examiner at consecutive times) reliability. Measures should be on the same scale to determine agreement.

▪

The intraclass correlation coefficient (ICC) is a common statistical method for statistically testing the agreement between two sets of data. Values range from 0 to 1.0 (1.0 = perfect accuracy/precision).

▪

For binary or categorical data, a κ (kappa) statistic can be used to determine agreement. The κ statistic has the same scale (0 to 1.0) as the ICC.