Biostatistics & Epidemiology — Part 3

Hypothesis Testing, Test Selection & Clinical Relevance

Austin Meyer, MD, PhD, MS, MPH, MS

2026-03-06

Lecture Overview

What We’ll Cover Today

Part 1: Framing Inference Correctly
- Null hypothesis language and interpretation
- Confounding and matching
Part 2: Choosing the Right Statistical Test
- Student t (unpaired t), chi-square, and Mann-Whitney U
Part 3: Clinical Impact vs Statistical Significance
- Number needed to treat (NNT)
- Statistical significance vs clinical relevance at the bedside
Part 4: Type II Error and Study Power
- Why “no significant difference” can be misleading in underpowered studies

How We’ll Work Through Questions

Present a clinical vignette and answer options
20–30 seconds of think time (pair‑share)
Reveal the answer and debrief with 2–3 teaching points
Where helpful, we’ll add a simple visual to anchor the concept

Part 1: Framing Inference Correctly

Question 1

A new study evaluates the effect of a novel antibiotic, compound A, in treating urinary tract infections in children. Children and adolescents with a confirmed urinary tract infection are randomized to receive either compound A or trimethoprim-sulfamethoxazole. The outcome of interest is microbiologic cure at day 3 of therapy.

Of the following, the BEST statement of a null hypothesis with regard to this study is that at 3 days, compound A

Debrief: Null Hypothesis Language

Correct Answer: B. has the same rate of microbiologic cure proportion as trimethoprim-sulfamethoxazole

The null hypothesis is the “no difference” statement.
In this vignette, H0 is that cure rates are the same between compound A and trimethoprim-sulfamethoxazole.
The alternative hypothesis would be that cure rates differ.
If prespecified alpha is .05, we reject H0 when p < .05 and otherwise fail to reject H0.

Debrief: Why The Other Choices Are Incorrect

A describes an alternative hypothesis (difference), not the null.
C and D mix hypothesis wording with post-hoc p-value interpretation.
“Equivalent” and “noninferior” require specific trial design assumptions and prespecified margins, not just p > .05.

Question 2

A researcher is interested in studying the effects of preterm birth on math and reading achievement. She plans to recruit children ages 6 to 12 years with a history of preterm birth who have a term-born sibling within the same age range to serve as a matched control.

Of the following, the use of a term-born sibling to serve as a matched control is BEST described as a means of minimizing

Debrief: Matching Minimizes Confounding

Correct Answer: B. confounding

Confounding occurs when a third variable is associated with both predictor and outcome.
Matching preterm children to term-born siblings helps reduce shared environmental confounders (eg, socioeconomic context, parental education).
This improves validity by reducing spurious causal interpretations.

Debrief: Confounding vs Other Errors

A (power): power depends on sample size, effect size, and variability; matching can affect precision, but here its primary purpose is confounding control.
C (type I error): false-positive risk is controlled by alpha/p-value thresholds.
D (type II error): false-negative risk is reduced by adequate power/sample size.
Matching is a design-stage confounding-control strategy.
In matched studies, analysis should account for the matching to preserve the design benefit.

Part 2: Choosing The Right Statistical Test

Question 3

A researcher is developing a double-blind, placebo-controlled, parallel-group randomized controlled trial assessing the effect of a new leukotriene receptor antagonist on forced expiratory volume in the first second (FEV1) rounded to the nearest 0.1 liter.

Of the following, the BEST statistical test to compare the FEV1 between the 2 groups (new drug vs placebo) at the end of this study is

Debrief: Student t Test For Two Independent Groups

Correct Answer: D. Student t test

Outcome is continuous (FEV1).
Design is parallel-group, so groups are independent.
Student t test compares means between 2 independent groups when assumptions are met.
In this context, “Student t test” and “unpaired t test” are synonymous.

Explanatory Figure: Quick Test Selection Matrix

Outcome Type	Groups	Independent?	Typical Test
Continuous, approximately normal	2	Yes	Student/unpaired t test
Continuous, approximately normal	2	No (paired/before-after)	Paired t test
Categorical	2+	Yes	Chi-square (or Fisher exact if sparse)
Categorical paired	2	No	McNemar test
Ordinal/ranked	2	Yes	Mann-Whitney U

What the Data Actually Look Like for Each Test

Question 4

The medical director of a large neonatal intensive care service discovers that about half of the medical records she reviewed lacked a graphic growth chart, contrary to unit protocol. Several of her colleagues thought that plotting daily growth parameters on paper became unnecessary since the advent of an electronic records system. Nevertheless, she hypothesized that weight gain might be enhanced with the use of such graphs and proposed studying the question.

She designed a retrospective study of consecutive admissions of premature infants who met the following criteria: born after the advent of electronic charting, hospital stay of at least 21 days, no congenital infection or anomaly, and no enterocolitis. She proposed examining the charts of 100 such infants. The presence or absence of a growth chart would be noted as well as the average daily weight gain during the second and third weeks after birth in terms of grams per kilogram of birthweight per day. She believes that weight gain will have an approximately normal distribution.

Of the following, the MOST appropriate statistical test to be used to address the hypothesis is

Debrief: Unpaired t Test

Correct Answer: D. unpaired t test

Dependent variable is continuous (weight gain) and expected to be approximately normal.
Groups are independent (growth chart present vs absent across different infants).
Paired t test would require matched pairs or repeated measures on the same participants.

Debrief: Why Other Tests Do Not Fit

A Fisher exact: for categorical outcomes with small expected cell counts.
B Mann-Whitney U: nonparametric alternative when continuous data are not normal.
C Paired t test: requires dependent/paired observations.

Question 5

A nephrology fellow wants to assess the prevalence of influenza in vaccinated and unvaccinated children 8 to 15 years of age with nephrotic syndrome, chronic kidney disease, and kidney transplant recipients during the upcoming winter season. His hypothesis is that the vaccine will be associated with a lower prevalence of influenza infection in the 3 groups.

Of the following, the MOST appropriate statistical test to analyze the results is

Debrief: Chi-square For Categorical Data

Correct Answer: A. chi-square test

Outcome is categorical (influenza infection: yes/no).
Comparison is across independent groups.
Chi-square is the standard statistical test for association in categorical data.
Fisher exact test is preferred when expected cell counts are very small.

Question 6

A researcher is designing a study to look at parent preferences in the treatment of juvenile arthritis. Parents will be asked to fill out a survey with the factors that are involved in their choice of treatment, and factors will be ranked and compared with medical staff responses. The hypothesis is that parent factors in treatment decision-making are different from medical staff factors.

Of the following, the BEST statistical test to evaluate this hypothesis is

Debrief: Ranked Data -> Mann-Whitney U

Correct Answer: C. Mann-Whitney U test

Preferences are ranked (ordinal), not continuous interval data.
Two independent groups are compared (parents vs medical staff).
Mann-Whitney U is the nonparametric 2-group test for ordinal/ranked outcomes.
Kruskal-Wallis is the nonparametric analog for 3+ groups.

Part 3: Clinical Impact vs Statistical Significance

Question 7

A physician is reviewing a journal article with a group of medical students. The article describes a randomized controlled trial comparing 2 warm-up exercise programs designed to prevent knee injuries in high school soccer athletes. The researchers found that 5% of the athletes who participated in a strengthening warm-up program sustained knee injuries during their soccer season compared to 15% of control group participants who used a static stretching warm-up program. The researchers concluded that for every 10 athletes who participated in the strengthening program, there was 1 fewer injury.

Of the following, the CONCEPT expressed in this last conclusion is known as

Debrief: Number Needed to Treat (NNT)

Correct Answer: B. number needed to treat

ARR = 15% − 5% = 10% = 0.10
NNT = 1 / ARR = 1 / 0.10 = 10
The statement “1 fewer injury for every 10 athletes” is exactly NNT language.
NNT is intuitive, but it does not by itself capture overall clinical context or harms.

Explanatory Figure: Risk Metrics Relationship

Metric	Formula	Value In Vignette
Risk in treatment group	Injuries with strengthening warm-up	5%
Risk in control group	Injuries with static stretching	15%
Absolute risk reduction (ARR)	0.15 - 0.05	0.10
Relative risk (RR, treatment vs control)	0.05 / 0.15	0.33
Number needed to treat (NNT)	1 / 0.10	10

Question 8

A 6-year-old girl with a deep vein thrombosis is treated with twice-daily injectable anticoagulants, which is the standard of care. Her parents report that every time she receives a dose of the drug, she screams and cries hysterically and will push, punch, and bite her parents as they try to give her the injection. In researching possible alternative treatments, the pediatrician identifies a large, randomized controlled, clinical trial in which the injectable standard-of-care drug was compared with a once-daily oral anticoagulant approved by the US Food and Drug Administration. There was no statistical difference in efficacy between the two therapeutic options. However, the rate of bleeding complications was 0.5% in the oral anticoagulant group vs 0.3% in the standard-of-care injectable drug group (p = .04).

Of the following, the MOST accurate statement about the use of the oral anticoagulant for this patient is that it

Debrief: Statistical Significance vs Clinical Significance

Correct Answer: A. may still be considered because the increased risk for bleeding complications is not clinically important despite the statistical significance

Trial shows similar efficacy between oral and injectable options.
Bleeding risk increase is statistically significant (p = .04) but very small in absolute terms (0.5% vs 0.3%).
That absolute difference is 0.2%, or about 2 additional events per 1,000 treated (NNH approximately 500).
Evidence-based decisions at the bedside must integrate patient burden, quality of life, harms, and benefits.

Debrief: Why Other Choices Are Incorrect

B: no FDA waiver implication follows from this statistical finding.
C: bleeding difference was statistically significant, so “no statistical difference” is incorrect.
D: “never” overstates risk and ignores individualized clinical tradeoff.

Part 4: Type II Error and Study Power

Question 9

A researcher is evaluating a new agent for the treatment of patent ductus arteriosus (PDA) in very-low-birthweight neonates. The researcher has collected data on 100 neonates with PDA. Of 50 neonates treated with a traditional agent, 13 responded, and of the 50 neonates taking the new agent, 18 responded. A Chi-square test is performed for independence and P = .28. The researcher concludes that there is no difference in the response to the two drugs. Several months later, a large multicenter trial is published that demonstrates efficacy of the new drug.

Of the following, the MOST likely statistical error the researcher has committed is

Debrief: Type II Error From Inadequate Power

Correct Answer: D. type II error

Type II error = false negative: failing to reject a false null hypothesis.
Here, an early small study concluded no difference, but a later large trial showed efficacy.
This pattern strongly suggests insufficient power/sample size in the original study.
Here, P = .28 indicates insufficient evidence at this sample size, not proof that no true difference exists.

Explanatory Figure: Type I vs Type II Errors

Study Decision	Reality: No True Difference	Reality: True Difference
Reject H0	Type I error (alpha)	Correct
Fail to reject H0	Correct	Type II error (beta)

Power = 1 - beta

Why n = 50 Was Not Enough: Power Curve for Q9

Question 10

Dr. Bone hypothesizes that a new biologic medication, unobtainimab, will be effective for the treatment of pediatric systemic lupus erythematosus (SLE). Dr. Bone’s research team performs a randomized, double-blind study comparing unobtainimab to placebo in adolescents with SLE. The primary outcome is SLE Disease Activity Index (SLEDAI) scores. Twenty patients who have severe organ involvement and for whom cyclophosphamide treatment has been unsuccessful are recruited. Subjects are randomized into 2 groups: 10 patients receive 8 twice-monthly infusions of unobtainimab and 10 patients receive twice-monthly intravenous normal saline as placebo. Subjects are followed for 3 months after completing the course of unobtainimab or placebo. No significant difference in SLEDAI scores is noted. Dr. Bone’s group publishes the findings, concluding that unobtainimab is no more effective than placebo for adolescents with SLE.

One year later, an international collaborative group headed by Dr. Joint publishes a much larger study (N = 893) that examines unobtainimab for the treatment of SLE in adolescents and uses SLEDAI scores as the primary outcome. Patients recently diagnosed with SLE are randomized to receive treatment with either unobtainimab or mycophenolate. This study shows a 28% benefit in overall SLEDAI scores at 6 months in subjects taking unobtainimab compared with the standard treatment group (P = 0.004). This study is published in a highly regarded journal, and the findings are later confirmed in other studies.

Of the following, Dr. Bone’s conclusion that unobtainimab is no more effective than placebo is BEST characterized as

Debrief: Type II Error — Same Pattern, New Twist

Correct Answer: D. a type II error

Same false-negative structure as Q9: small underpowered study (n = 20 total) missed a real effect confirmed later by a much larger trial.
The key challenge here is distinguishing type II error from the two bias options in the answer choices.
Reporting bias and response bias are flaws in how data are collected or disseminated — not errors in statistical inference from an underpowered sample.

Debrief: Distinguishing Biases From Error Types

Response bias: participants give answers they think investigators want.
Reporting bias: selective dissemination/publication of findings.
Neither bias label best explains this vignette’s sequence as well as type II error.

Key Takeaways

Phrase null hypotheses as no-difference statements.
Choose tests by data type, distribution, and independence.
Interpret p-values alongside effect size and clinical context.
Plan power/sample size early to reduce false negatives.