Tasks 07-01 - Descriptive Statistics

Section 07: Probability & Statistics

Problem 1: Measures of Central Tendency (x)

For the dataset: \(15, 22, 18, 25, 22, 19, 22, 28, 17, 22\)

  1. Calculate the mean.
  2. Find the median.
  3. Find the mode.
  4. Which measure best represents the “typical” value? Why?
  1. Mean: \(\bar{x} = \frac{15+22+18+25+22+19+22+28+17+22}{10} = \frac{210}{10} = 21\)

  2. Sorted: \(15, 17, 18, 19, 22, 22, 22, 22, 25, 28\) Median = \(\frac{22 + 22}{2} = 22\)

  3. Mode = 22 (appears 4 times)

  4. The mode (22) best represents the typical value because it’s the most frequent value. The mean (21) is slightly pulled down by the lower values 15 and 17.

Problem 2: Variance and Standard Deviation (x)

For the dataset: \(8, 12, 15, 11, 14\)

  1. Calculate the mean.
  2. Calculate the sample variance.
  3. Calculate the sample standard deviation.
  1. Mean: \(\bar{x} = \frac{8+12+15+11+14}{5} = \frac{60}{5} = 12\)

  2. Variance: Deviations from mean: \((8-12)^2 = 16\), \((12-12)^2 = 0\), \((15-12)^2 = 9\), \((11-12)^2 = 1\), \((14-12)^2 = 4\)

    Sum of squared deviations: \(16 + 0 + 9 + 1 + 4 = 30\)

    Sample variance: \(s^2 = \frac{30}{5-1} = \frac{30}{4} = 7.5\)

  3. Standard deviation: \(s = \sqrt{7.5} \approx 2.74\)

Problem 3: Range and IQR (x)

For the dataset: \(42, 55, 63, 48, 71, 59, 45, 67, 52, 58, 61, 49\)

  1. Find the range.
  2. Find Q1 (first quartile).
  3. Find Q3 (third quartile).
  4. Calculate the interquartile range (IQR).

Sorted data: \(42, 45, 48, 49, 52, 55, 58, 59, 61, 63, 67, 71\)

  1. Range = Max - Min = \(71 - 42 = 29\)

  2. Lower half: \(42, 45, 48, 49, 52, 55\) Q1 = median of lower half = \(\frac{48 + 49}{2} = 48.5\)

  3. Upper half: \(58, 59, 61, 63, 67, 71\) Q3 = median of upper half = \(\frac{61 + 63}{2} = 62\)

  4. IQR = Q3 - Q1 = \(62 - 48.5 = 13.5\)

Problem 4: Outlier Detection (xx)

For the dataset: \(25, 28, 30, 32, 27, 29, 31, 85, 26, 30\)

  1. Calculate Q1, Q3, and IQR.
  2. Determine the lower and upper fences for outliers.
  3. Are there any outliers? If so, which value(s)?
  4. Recalculate the mean with and without outliers.

Sorted: \(25, 26, 27, 28, 29, 30, 30, 31, 32, 85\)

  1. Lower half: \(25, 26, 27, 28, 29\) → Q1 = 27 Upper half: \(30, 30, 31, 32, 85\) → Q3 = 31 IQR = \(31 - 27 = 4\)

  2. Lower fence = Q1 - 1.5 × IQR = \(27 - 1.5(4) = 27 - 6 = 21\) Upper fence = Q3 + 1.5 × IQR = \(31 + 1.5(4) = 31 + 6 = 37\)

  3. Values below 21 or above 37 are outliers. 85 is an outlier (85 > 37)

  4. With outlier: \(\bar{x} = \frac{25+26+27+28+29+30+30+31+32+85}{10} = \frac{343}{10} = 34.3\)

    Without outlier: \(\bar{x} = \frac{25+26+27+28+29+30+30+31+32}{9} = \frac{258}{9} = 28.67\)

    The outlier increases the mean by 5.63!

Problem 5: Frequency Distribution (x)

Test scores for 20 students: \(65, 72, 78, 85, 91, 68, 74, 82, 88, 95, 71, 77, 83, 89, 73, 79, 84, 92, 76, 81\)

  1. Create a frequency table using intervals: 65-74, 75-84, 85-94, 95-100
  2. Calculate the relative frequency for each interval.
  3. What percentage of students scored between 75 and 84?
  1. & b) Frequency table:
Score Range Tally Frequency Relative Frequency
65-74 IIII I 6 6/20 = 30%
75-84 IIII III 8 8/20 = 40%
85-94 IIII 5 5/20 = 25%
95-100 I 1 1/20 = 5%
Total 20 100%
  1. 40% of students scored between 75 and 84.

Problem 6: Comparing Datasets (xx)

Two sales teams’ weekly sales (in units):

Team A: \(45, 52, 48, 55, 50\) Team B: \(30, 70, 45, 60, 45\)

  1. Calculate the mean for each team.
  2. Calculate the standard deviation for each team.
  3. Which team is more consistent? Why?
  4. Which team would you prefer to manage? Justify your answer.
  1. Team A: \(\bar{x}_A = \frac{45+52+48+55+50}{5} = \frac{250}{5} = 50\) Team B: \(\bar{x}_B = \frac{30+70+45+60+45}{5} = \frac{250}{5} = 50\)

  2. Team A: Deviations: \((45-50)^2=25\), \((52-50)^2=4\), \((48-50)^2=4\), \((55-50)^2=25\), \((50-50)^2=0\) \(s_A^2 = \frac{25+4+4+25+0}{4} = \frac{58}{4} = 14.5\) \(s_A = \sqrt{14.5} = 3.81\)

    Team B: Deviations: \((30-50)^2=400\), \((70-50)^2=400\), \((45-50)^2=25\), \((60-50)^2=100\), \((45-50)^2=25\) \(s_B^2 = \frac{400+400+25+100+25}{4} = \frac{950}{4} = 237.5\) \(s_B = \sqrt{237.5} = 15.41\)

  3. Team A is more consistent because its standard deviation (3.81) is much lower than Team B’s (15.41).

  4. Answers may vary. Team A is more predictable and easier to plan around. Team B has higher highs but also lower lows - more variable performance.

Problem 7: Five-Number Summary (xx)

Monthly revenue data (in thousands Euro): \(120, 145, 132, 158, 175, 142, 138, 165, 155, 148, 162, 170\)

  1. Find the five-number summary (Min, Q1, Median, Q3, Max).
  2. Calculate the IQR.
  3. Describe the shape of the distribution based on the five-number summary.

Sorted: \(120, 132, 138, 142, 145, 148, 155, 158, 162, 165, 170, 175\)

  1. Five-Number Summary:

    • Minimum: 120
    • Q1: median of {120, 132, 138, 142, 145, 148} = \(\frac{138+142}{2} = 140\)
    • Median: \(\frac{148+155}{2} = 151.5\)
    • Q3: median of {155, 158, 162, 165, 170, 175} = \(\frac{162+165}{2} = 163.5\)
    • Maximum: 175
  2. IQR = Q3 - Q1 = \(163.5 - 140 = 23.5\)

  3. Shape analysis:

    • Distance from Min to Q1: \(140 - 120 = 20\)
    • Distance from Q1 to Median: \(151.5 - 140 = 11.5\)
    • Distance from Median to Q3: \(163.5 - 151.5 = 12\)
    • Distance from Q3 to Max: \(175 - 163.5 = 11.5\)

    The distribution is slightly left-skewed (longer left tail), as the distance from minimum to Q1 is larger than from Q3 to maximum.

Problem 8: Grouped Data (xxx)

Employee salaries (in thousands Euro) at a company are grouped:

Salary Range Frequency
30-39 8
40-49 15
50-59 22
60-69 12
70-79 3
  1. Estimate the mean salary using midpoints.
  2. Find the modal class.
  3. Estimate the median class.
  4. Calculate the relative frequency for each class.
Range Midpoint (m) Freq (f) f × m Rel. Freq
30-39 34.5 8 276 8/60 = 13.3%
40-49 44.5 15 667.5 15/60 = 25%
50-59 54.5 22 1199 22/60 = 36.7%
60-69 64.5 12 774 12/60 = 20%
70-79 74.5 3 223.5 3/60 = 5%
Total 60 3140 100%
  1. Estimated mean: \(\bar{x} = \frac{3140}{60} = 52.33\) thousand Euro

  2. Modal class: 50-59 (highest frequency of 22)

  3. Median position: \(\frac{60+1}{2} = 30.5\)th value Cumulative frequencies: 8, 23, 45, 57, 60 The 30.5th value falls in the 50-59 class (cumulative > 30.5 at position 45)

  4. Relative frequencies shown in table above.

Problem 9: Business Application (xx)

A quality control manager measures the diameter of manufactured bolts (in mm):

\(10.02, 9.98, 10.05, 9.97, 10.01, 10.03, 9.99, 10.02, 10.00, 9.96, 10.04, 10.01\)

Target diameter: 10.00 mm with tolerance ±0.05 mm

  1. Calculate the mean diameter.
  2. Calculate the standard deviation.
  3. Are all bolts within specification?
  4. If bolts outside tolerance are rejected, what is the reject rate?
  1. Mean: \(\bar{x} = \frac{10.02+9.98+10.05+9.97+10.01+10.03+9.99+10.02+10.00+9.96+10.04+10.01}{12}\) \(= \frac{120.08}{12} = 10.007\) mm

  2. Deviations from 10.007: \((0.013)^2, (-0.027)^2, (0.043)^2, (-0.037)^2, (0.003)^2, (0.023)^2, (-0.017)^2, (0.013)^2, (-0.007)^2, (-0.047)^2, (0.033)^2, (0.003)^2\)

    Sum = \(0.000169 + 0.000729 + 0.001849 + 0.001369 + 0.000009 + 0.000529 + 0.000289 + 0.000169 + 0.000049 + 0.002209 + 0.001089 + 0.000009 = 0.008468\)

    \(s^2 = \frac{0.008468}{11} = 0.00077\) \(s = \sqrt{0.00077} = 0.028\) mm

  3. Tolerance range: 9.95 to 10.05 mm Check each value: All values are within 9.95-10.05 mm. Yes, all bolts are within specification.

  4. Reject rate = 0% (all bolts pass)

Problem 10: Coefficient of Variation (xx)

Compare the variability of these two datasets using the coefficient of variation:

Dataset X (prices in Euro): \(50, 55, 45, 60, 40\) Dataset Y (prices in cents): \(5000, 5500, 4500, 6000, 4000\)

  1. Calculate mean and standard deviation for both datasets.
  2. Calculate the coefficient of variation (CV = s/mean × 100%) for both.
  3. Which dataset has more relative variability?
  1. Dataset X: Mean: \(\bar{x}_X = \frac{50+55+45+60+40}{5} = 50\) Euro Variance: \(s_X^2 = \frac{(0)^2+(5)^2+(-5)^2+(10)^2+(-10)^2}{4} = \frac{250}{4} = 62.5\) Std Dev: \(s_X = 7.91\) Euro

    Dataset Y: Mean: \(\bar{x}_Y = \frac{5000+5500+4500+6000+4000}{5} = 5000\) cents Variance: \(s_Y^2 = \frac{(0)^2+(500)^2+(-500)^2+(1000)^2+(-1000)^2}{4} = \frac{2500000}{4} = 625000\) Std Dev: \(s_Y = 790.6\) cents

  2. Coefficient of Variation: \(CV_X = \frac{7.91}{50} \times 100\% = 15.82\%\) \(CV_Y = \frac{790.6}{5000} \times 100\% = 15.81\%\)

  3. Both datasets have essentially the same relative variability (15.8%).

    This makes sense because Dataset Y is just Dataset X expressed in different units (cents instead of Euros). The CV is unit-free, so it shows they have the same underlying variability.

Problem 11: Percentiles (xxx)

For the dataset: \(12, 15, 18, 22, 25, 28, 31, 35, 38, 42, 45, 48, 52, 55, 58, 62, 65, 68, 72, 75\)

  1. Find the 25th percentile (P25).
  2. Find the 75th percentile (P75).
  3. Find the 90th percentile (P90).
  4. If a value is at the 60th percentile, how many values are below it?

Data is sorted, n = 20 values.

  1. P25 position: \(0.25 \times (20+1) = 5.25\) P25 = 5th value + 0.25 × (6th - 5th) = \(25 + 0.25(28-25) = 25 + 0.75 = 25.75\)

  2. P75 position: \(0.75 \times 21 = 15.75\) P75 = 15th value + 0.75 × (16th - 15th) = \(58 + 0.75(62-58) = 58 + 3 = 61\)

  3. P90 position: \(0.90 \times 21 = 18.9\) P90 = 18th value + 0.9 × (19th - 18th) = \(68 + 0.9(72-68) = 68 + 3.6 = 71.6\)

  4. 60th percentile means 60% of values are below it. \(0.60 \times 20 = 12\) values are below the 60th percentile.

Problem 12: Comprehensive Analysis (xxxx)

A store tracks daily customer counts for 30 days:

\(42, 58, 65, 38, 71, 45, 52, 67, 55, 48, 63, 72, 44, 59, 68, 51, 56, 74, 41, 62, 49, 57, 69, 46, 54, 70, 43, 60, 66, 50\)

  1. Calculate all measures of central tendency (mean, median, mode).
  2. Calculate range, variance, standard deviation, and IQR.
  3. Construct the five-number summary.
  4. Identify any outliers using the 1.5 × IQR rule.
  5. Create a frequency distribution with 5 equal-width classes.
  6. What can you conclude about the store’s daily customer traffic?

Sorted data: \(38, 41, 42, 43, 44, 45, 46, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 59, 60, 62, 63, 65, 66, 67, 68, 69, 70, 71, 72, 74\)

  1. Mean: \(\bar{x} = \frac{1697}{30} = 56.57\) customers

    Median: Average of 15th and 16th values = \(\frac{56+57}{2} = 56.5\) customers

    Mode: No mode (all values appear once)

  2. Range: \(74 - 38 = 36\) customers

    Variance: Sum of squared deviations = 2642.97 \(s^2 = \frac{2642.97}{29} = 91.14\)

    Standard deviation: \(s = \sqrt{91.14} = 9.55\) customers

    IQR: Q1 (position 7.75): \(\frac{46+48}{2} = 47\) (using simple method) Q3 (position 23.25): \(\frac{66+67}{2} = 66.5\) IQR = \(66.5 - 47 = 19.5\)

  3. Five-Number Summary:

    • Min: 38
    • Q1: 47
    • Median: 56.5
    • Q3: 66.5
    • Max: 74
  4. Outlier Detection: Lower fence = \(47 - 1.5(19.5) = 47 - 29.25 = 17.75\) Upper fence = \(66.5 + 1.5(19.5) = 66.5 + 29.25 = 95.75\)

    All values are between 17.75 and 95.75. No outliers.

  5. Frequency Distribution: Range = 36, Class width = 36/5 = 7.2 ≈ 8

    Class Frequency Rel. Freq.
    38-45 6 20%
    46-53 6 20%
    54-61 7 23.3%
    62-69 7 23.3%
    70-77 4 13.3%
    Total 30 100%
  6. Conclusions:

    • Average daily traffic is about 57 customers
    • Traffic is fairly consistent (CV = 9.55/56.57 = 16.9%)
    • The distribution is roughly symmetric (mean ≈ median)
    • No extreme days (no outliers)
    • Most days see between 46 and 69 customers (67% of days)
    • The store can plan staffing around 50-65 customers with reasonable confidence