Tasks 07-01 - Descriptive Statistics
Section 07: Probability & Statistics
Problem 1: Measures of Central Tendency (x)
For the dataset: \(15, 22, 18, 25, 22, 19, 22, 28, 17, 22\)
- Calculate the mean.
- Find the median.
- Find the mode.
- Which measure best represents the “typical” value? Why?
Mean: \(\bar{x} = \frac{15+22+18+25+22+19+22+28+17+22}{10} = \frac{210}{10} = 21\)
Sorted: \(15, 17, 18, 19, 22, 22, 22, 22, 25, 28\) Median = \(\frac{22 + 22}{2} = 22\)
Mode = 22 (appears 4 times)
The mode (22) best represents the typical value because it’s the most frequent value. The mean (21) is slightly pulled down by the lower values 15 and 17.
Problem 2: Variance and Standard Deviation (x)
For the dataset: \(8, 12, 15, 11, 14\)
- Calculate the mean.
- Calculate the sample variance.
- Calculate the sample standard deviation.
Mean: \(\bar{x} = \frac{8+12+15+11+14}{5} = \frac{60}{5} = 12\)
Variance: Deviations from mean: \((8-12)^2 = 16\), \((12-12)^2 = 0\), \((15-12)^2 = 9\), \((11-12)^2 = 1\), \((14-12)^2 = 4\)
Sum of squared deviations: \(16 + 0 + 9 + 1 + 4 = 30\)
Sample variance: \(s^2 = \frac{30}{5-1} = \frac{30}{4} = 7.5\)
Standard deviation: \(s = \sqrt{7.5} \approx 2.74\)
Problem 3: Range and IQR (x)
For the dataset: \(42, 55, 63, 48, 71, 59, 45, 67, 52, 58, 61, 49\)
- Find the range.
- Find Q1 (first quartile).
- Find Q3 (third quartile).
- Calculate the interquartile range (IQR).
Sorted data: \(42, 45, 48, 49, 52, 55, 58, 59, 61, 63, 67, 71\)
Range = Max - Min = \(71 - 42 = 29\)
Lower half: \(42, 45, 48, 49, 52, 55\) Q1 = median of lower half = \(\frac{48 + 49}{2} = 48.5\)
Upper half: \(58, 59, 61, 63, 67, 71\) Q3 = median of upper half = \(\frac{61 + 63}{2} = 62\)
IQR = Q3 - Q1 = \(62 - 48.5 = 13.5\)
Problem 4: Outlier Detection (xx)
For the dataset: \(25, 28, 30, 32, 27, 29, 31, 85, 26, 30\)
- Calculate Q1, Q3, and IQR.
- Determine the lower and upper fences for outliers.
- Are there any outliers? If so, which value(s)?
- Recalculate the mean with and without outliers.
Sorted: \(25, 26, 27, 28, 29, 30, 30, 31, 32, 85\)
Lower half: \(25, 26, 27, 28, 29\) → Q1 = 27 Upper half: \(30, 30, 31, 32, 85\) → Q3 = 31 IQR = \(31 - 27 = 4\)
Lower fence = Q1 - 1.5 × IQR = \(27 - 1.5(4) = 27 - 6 = 21\) Upper fence = Q3 + 1.5 × IQR = \(31 + 1.5(4) = 31 + 6 = 37\)
Values below 21 or above 37 are outliers. 85 is an outlier (85 > 37)
With outlier: \(\bar{x} = \frac{25+26+27+28+29+30+30+31+32+85}{10} = \frac{343}{10} = 34.3\)
Without outlier: \(\bar{x} = \frac{25+26+27+28+29+30+30+31+32}{9} = \frac{258}{9} = 28.67\)
The outlier increases the mean by 5.63!
Problem 5: Frequency Distribution (x)
Test scores for 20 students: \(65, 72, 78, 85, 91, 68, 74, 82, 88, 95, 71, 77, 83, 89, 73, 79, 84, 92, 76, 81\)
- Create a frequency table using intervals: 65-74, 75-84, 85-94, 95-100
- Calculate the relative frequency for each interval.
- What percentage of students scored between 75 and 84?
- & b) Frequency table:
| Score Range | Tally | Frequency | Relative Frequency |
|---|---|---|---|
| 65-74 | IIII I | 6 | 6/20 = 30% |
| 75-84 | IIII III | 8 | 8/20 = 40% |
| 85-94 | IIII | 5 | 5/20 = 25% |
| 95-100 | I | 1 | 1/20 = 5% |
| Total | 20 | 100% |
- 40% of students scored between 75 and 84.
Problem 6: Comparing Datasets (xx)
Two sales teams’ weekly sales (in units):
Team A: \(45, 52, 48, 55, 50\) Team B: \(30, 70, 45, 60, 45\)
- Calculate the mean for each team.
- Calculate the standard deviation for each team.
- Which team is more consistent? Why?
- Which team would you prefer to manage? Justify your answer.
Team A: \(\bar{x}_A = \frac{45+52+48+55+50}{5} = \frac{250}{5} = 50\) Team B: \(\bar{x}_B = \frac{30+70+45+60+45}{5} = \frac{250}{5} = 50\)
Team A: Deviations: \((45-50)^2=25\), \((52-50)^2=4\), \((48-50)^2=4\), \((55-50)^2=25\), \((50-50)^2=0\) \(s_A^2 = \frac{25+4+4+25+0}{4} = \frac{58}{4} = 14.5\) \(s_A = \sqrt{14.5} = 3.81\)
Team B: Deviations: \((30-50)^2=400\), \((70-50)^2=400\), \((45-50)^2=25\), \((60-50)^2=100\), \((45-50)^2=25\) \(s_B^2 = \frac{400+400+25+100+25}{4} = \frac{950}{4} = 237.5\) \(s_B = \sqrt{237.5} = 15.41\)
Team A is more consistent because its standard deviation (3.81) is much lower than Team B’s (15.41).
Answers may vary. Team A is more predictable and easier to plan around. Team B has higher highs but also lower lows - more variable performance.
Problem 7: Five-Number Summary (xx)
Monthly revenue data (in thousands Euro): \(120, 145, 132, 158, 175, 142, 138, 165, 155, 148, 162, 170\)
- Find the five-number summary (Min, Q1, Median, Q3, Max).
- Calculate the IQR.
- Describe the shape of the distribution based on the five-number summary.
Sorted: \(120, 132, 138, 142, 145, 148, 155, 158, 162, 165, 170, 175\)
Five-Number Summary:
- Minimum: 120
- Q1: median of {120, 132, 138, 142, 145, 148} = \(\frac{138+142}{2} = 140\)
- Median: \(\frac{148+155}{2} = 151.5\)
- Q3: median of {155, 158, 162, 165, 170, 175} = \(\frac{162+165}{2} = 163.5\)
- Maximum: 175
IQR = Q3 - Q1 = \(163.5 - 140 = 23.5\)
Shape analysis:
- Distance from Min to Q1: \(140 - 120 = 20\)
- Distance from Q1 to Median: \(151.5 - 140 = 11.5\)
- Distance from Median to Q3: \(163.5 - 151.5 = 12\)
- Distance from Q3 to Max: \(175 - 163.5 = 11.5\)
The distribution is slightly left-skewed (longer left tail), as the distance from minimum to Q1 is larger than from Q3 to maximum.
Problem 8: Grouped Data (xxx)
Employee salaries (in thousands Euro) at a company are grouped:
| Salary Range | Frequency |
|---|---|
| 30-39 | 8 |
| 40-49 | 15 |
| 50-59 | 22 |
| 60-69 | 12 |
| 70-79 | 3 |
- Estimate the mean salary using midpoints.
- Find the modal class.
- Estimate the median class.
- Calculate the relative frequency for each class.
| Range | Midpoint (m) | Freq (f) | f × m | Rel. Freq |
|---|---|---|---|---|
| 30-39 | 34.5 | 8 | 276 | 8/60 = 13.3% |
| 40-49 | 44.5 | 15 | 667.5 | 15/60 = 25% |
| 50-59 | 54.5 | 22 | 1199 | 22/60 = 36.7% |
| 60-69 | 64.5 | 12 | 774 | 12/60 = 20% |
| 70-79 | 74.5 | 3 | 223.5 | 3/60 = 5% |
| Total | 60 | 3140 | 100% |
Estimated mean: \(\bar{x} = \frac{3140}{60} = 52.33\) thousand Euro
Modal class: 50-59 (highest frequency of 22)
Median position: \(\frac{60+1}{2} = 30.5\)th value Cumulative frequencies: 8, 23, 45, 57, 60 The 30.5th value falls in the 50-59 class (cumulative > 30.5 at position 45)
Relative frequencies shown in table above.
Problem 9: Business Application (xx)
A quality control manager measures the diameter of manufactured bolts (in mm):
\(10.02, 9.98, 10.05, 9.97, 10.01, 10.03, 9.99, 10.02, 10.00, 9.96, 10.04, 10.01\)
Target diameter: 10.00 mm with tolerance ±0.05 mm
- Calculate the mean diameter.
- Calculate the standard deviation.
- Are all bolts within specification?
- If bolts outside tolerance are rejected, what is the reject rate?
Mean: \(\bar{x} = \frac{10.02+9.98+10.05+9.97+10.01+10.03+9.99+10.02+10.00+9.96+10.04+10.01}{12}\) \(= \frac{120.08}{12} = 10.007\) mm
Deviations from 10.007: \((0.013)^2, (-0.027)^2, (0.043)^2, (-0.037)^2, (0.003)^2, (0.023)^2, (-0.017)^2, (0.013)^2, (-0.007)^2, (-0.047)^2, (0.033)^2, (0.003)^2\)
Sum = \(0.000169 + 0.000729 + 0.001849 + 0.001369 + 0.000009 + 0.000529 + 0.000289 + 0.000169 + 0.000049 + 0.002209 + 0.001089 + 0.000009 = 0.008468\)
\(s^2 = \frac{0.008468}{11} = 0.00077\) \(s = \sqrt{0.00077} = 0.028\) mm
Tolerance range: 9.95 to 10.05 mm Check each value: All values are within 9.95-10.05 mm. Yes, all bolts are within specification.
Reject rate = 0% (all bolts pass)
Problem 10: Coefficient of Variation (xx)
Compare the variability of these two datasets using the coefficient of variation:
Dataset X (prices in Euro): \(50, 55, 45, 60, 40\) Dataset Y (prices in cents): \(5000, 5500, 4500, 6000, 4000\)
- Calculate mean and standard deviation for both datasets.
- Calculate the coefficient of variation (CV = s/mean × 100%) for both.
- Which dataset has more relative variability?
Dataset X: Mean: \(\bar{x}_X = \frac{50+55+45+60+40}{5} = 50\) Euro Variance: \(s_X^2 = \frac{(0)^2+(5)^2+(-5)^2+(10)^2+(-10)^2}{4} = \frac{250}{4} = 62.5\) Std Dev: \(s_X = 7.91\) Euro
Dataset Y: Mean: \(\bar{x}_Y = \frac{5000+5500+4500+6000+4000}{5} = 5000\) cents Variance: \(s_Y^2 = \frac{(0)^2+(500)^2+(-500)^2+(1000)^2+(-1000)^2}{4} = \frac{2500000}{4} = 625000\) Std Dev: \(s_Y = 790.6\) cents
Coefficient of Variation: \(CV_X = \frac{7.91}{50} \times 100\% = 15.82\%\) \(CV_Y = \frac{790.6}{5000} \times 100\% = 15.81\%\)
Both datasets have essentially the same relative variability (15.8%).
This makes sense because Dataset Y is just Dataset X expressed in different units (cents instead of Euros). The CV is unit-free, so it shows they have the same underlying variability.
Problem 11: Percentiles (xxx)
For the dataset: \(12, 15, 18, 22, 25, 28, 31, 35, 38, 42, 45, 48, 52, 55, 58, 62, 65, 68, 72, 75\)
- Find the 25th percentile (P25).
- Find the 75th percentile (P75).
- Find the 90th percentile (P90).
- If a value is at the 60th percentile, how many values are below it?
Data is sorted, n = 20 values.
P25 position: \(0.25 \times (20+1) = 5.25\) P25 = 5th value + 0.25 × (6th - 5th) = \(25 + 0.25(28-25) = 25 + 0.75 = 25.75\)
P75 position: \(0.75 \times 21 = 15.75\) P75 = 15th value + 0.75 × (16th - 15th) = \(58 + 0.75(62-58) = 58 + 3 = 61\)
P90 position: \(0.90 \times 21 = 18.9\) P90 = 18th value + 0.9 × (19th - 18th) = \(68 + 0.9(72-68) = 68 + 3.6 = 71.6\)
60th percentile means 60% of values are below it. \(0.60 \times 20 = 12\) values are below the 60th percentile.
Problem 12: Comprehensive Analysis (xxxx)
A store tracks daily customer counts for 30 days:
\(42, 58, 65, 38, 71, 45, 52, 67, 55, 48, 63, 72, 44, 59, 68, 51, 56, 74, 41, 62, 49, 57, 69, 46, 54, 70, 43, 60, 66, 50\)
- Calculate all measures of central tendency (mean, median, mode).
- Calculate range, variance, standard deviation, and IQR.
- Construct the five-number summary.
- Identify any outliers using the 1.5 × IQR rule.
- Create a frequency distribution with 5 equal-width classes.
- What can you conclude about the store’s daily customer traffic?
Sorted data: \(38, 41, 42, 43, 44, 45, 46, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 59, 60, 62, 63, 65, 66, 67, 68, 69, 70, 71, 72, 74\)
Mean: \(\bar{x} = \frac{1697}{30} = 56.57\) customers
Median: Average of 15th and 16th values = \(\frac{56+57}{2} = 56.5\) customers
Mode: No mode (all values appear once)
Range: \(74 - 38 = 36\) customers
Variance: Sum of squared deviations = 2642.97 \(s^2 = \frac{2642.97}{29} = 91.14\)
Standard deviation: \(s = \sqrt{91.14} = 9.55\) customers
IQR: Q1 (position 7.75): \(\frac{46+48}{2} = 47\) (using simple method) Q3 (position 23.25): \(\frac{66+67}{2} = 66.5\) IQR = \(66.5 - 47 = 19.5\)
Five-Number Summary:
- Min: 38
- Q1: 47
- Median: 56.5
- Q3: 66.5
- Max: 74
Outlier Detection: Lower fence = \(47 - 1.5(19.5) = 47 - 29.25 = 17.75\) Upper fence = \(66.5 + 1.5(19.5) = 66.5 + 29.25 = 95.75\)
All values are between 17.75 and 95.75. No outliers.
Frequency Distribution: Range = 36, Class width = 36/5 = 7.2 ≈ 8
Class Frequency Rel. Freq. 38-45 6 20% 46-53 6 20% 54-61 7 23.3% 62-69 7 23.3% 70-77 4 13.3% Total 30 100% Conclusions:
- Average daily traffic is about 57 customers
- Traffic is fairly consistent (CV = 9.55/56.57 = 16.9%)
- The distribution is roughly symmetric (mean ≈ median)
- No extreme days (no outliers)
- Most days see between 46 and 69 customers (67% of days)
- The store can plan staffing around 50-65 customers with reasonable confidence