Section 07: Probability & Statistics
State Bayes’ Theorem.
A test has sensitivity 80% and specificity 90%. If prevalence is 10%, calculate PPV.
What’s the difference between sensitivity and PPV?
If PPV is low but NPV is high, what does this tell us about the test?
Contingency tables are a key exam format - expect at least one problem!
A contingency table shows the joint distribution of two categorical variables.
| \(B\) | \(\bar{B}\) | Total | |
|---|---|---|---|
| \(A\) | \(n_{AB}\) | \(n_{A\bar{B}}\) | \(n_A\) |
| \(\bar{A}\) | \(n_{\bar{A}B}\) | \(n_{\bar{A}\bar{B}}\) | \(n_{\bar{A}}\) |
| Total | \(n_B\) | \(n_{\bar{B}}\) | \(n\) |
| Type | Formula | Location in Table |
|---|---|---|
| Marginal | \(P(A)\) | Row total / Grand total |
| Joint | \(P(A \cap B)\) | Cell / Grand total |
| Conditional | \(P(A\|B)\) | Cell / Column total |
Survey of 500 customers about product preference and age:
| Age < 30 | Age ≥ 30 | Total | |
|---|---|---|---|
| Prefers A | 120 | 80 | 200 |
| Prefers B | 130 | 170 | 300 |
| Total | 250 | 250 | 500 |
Calculate:
Step-by-Step Approach
In a city of 10,000 residents:
Construct the contingency table.
| Adult | Minor | Total | |
|---|---|---|---|
| Employed | 3500 | ? | 4000 |
| Not Employed | ? | ? | 6000 |
| Total | 7000 | 3000 | 10000 |
| Adult | Minor | Total | |
|---|---|---|---|
| Employed | 3500 | 500 | 4000 |
| Not Employed | 3500 | 2500 | 6000 |
| Total | 7000 | 3000 | 10000 |
Now we can answer questions like:
A company surveyed 200 customers:
Build the table:
Step 1: Fill in what we know directly
| Repeat | New | Total | |
|---|---|---|---|
| Satisfied | ? | ? | 120 |
| Not Satisfied | ? | ? | 80 |
| Total | 90 | 110 | 200 |
Step 2: Use “Of satisfied, 60% are repeat”
\(P(\text{Repeat}|\text{Satisfied}) = 0.60\), so \(120 \times 0.60 = 72\) repeat AND satisfied
| Repeat | New | Total | |
|---|---|---|---|
| Satisfied | 72 | 48 | 120 |
| Not Satisfied | 18 | 62 | 80 |
| Total | 90 | 110 | 200 |
Verify: All rows and columns sum correctly ✓
Independence in Tables
Variables A and B are independent if and only if for all cells:
\[P(A \cap B) = P(A) \cdot P(B)\]
Or equivalently: \(\frac{\text{Cell count}}{\text{Total}} = \frac{\text{Row total}}{\text{Total}} \times \frac{\text{Column total}}{\text{Total}}\)
From our customer survey:
| Repeat | New | Total | |
|---|---|---|---|
| Satisfied | 72 | 48 | 120 |
| Not Satisfied | 18 | 62 | 80 |
| Total | 90 | 110 | 200 |
Test independence for (Satisfied, Repeat):
\(72 \neq 54\), so satisfaction and repeat status are NOT independent.
The data suggests:
Satisfied customers are about 2.7 times more likely to be repeat customers!
The contingency table method from Session 07-05 is actually using this technique!
Medical testing example:
| Disease | No Disease | Total | |
|---|---|---|---|
| Test + | TP | FP | All + |
| Test − | FN | TN | All − |
| Total | Diseased | Healthy | Population |
Given: Sensitivity = 90%, Specificity = 95%, Prevalence = 2%
For 10,000 people:
| Disease (200) | No Disease (9800) | Total | |
|---|---|---|---|
| Test + | 180 | 490 | 670 |
| Test − | 20 | 9310 | 9330 |
Direct calculations: - PPV = \(\frac{180}{670} \approx 0.269\) - NPV = \(\frac{9310}{9330} \approx 0.998\)
A survey of 400 employees found:
Tasks: a) Construct the contingency table b) Find \(P(\text{Grad degree}|\text{Full-time})\) c) Find \(P(\text{Full-time}|\text{Grad degree})\) d) Are full-time status and graduate degree independent?
A company produces items at two factories. Quality control data:
Tasks: a) Construct a contingency table b) An item is randomly selected and found defective. What’s the probability it came from Factory A? c) What percentage of all items are defective?
Homework
Complete Tasks 07-06 - practice building and reading contingency tables!
Session 07-06 - Contingency Tables | Dr. Nikolai Heinrichs & Dr. Tobias Vlćek | Home