Chapter 4. Measure of Central Tendency and Dispersion

🎲 🎯 ⚖ μ σ ρ

4.1 Measure of Central Tendency – Mean and Median

4.2 Measure of Dispersion – Standard Deviation

4.3 Covariance and Correlation Coefficient

4.4 Exercise

[Chapter4 pdf]

In the case of quantitative data, the central tendency and dispersion of the data are measured and analyzed.
- Central tendency : mean / median
- Dispersion : variance / standard deviation
Data for two quantitative variables are measured using a covariance and a correlation coefficient.

4.1 Measure of Central Tendency – Mean and Median

⭐ Think

The data obtained by taking a sample of 5 middle school students and surveying their weight are as follows:

(Data 4.1) Weights of five middle school students (kg)

63 60 65 55 77

💎 Explore

1) What kind of graph is used to find a representative value of these data?
2) What would be a representative value for the weight of 5 students?

The average (or mean) is a measure of central tendency of quantitative data and is widely used as a representative value for the data. The mean is the sum of all data and divided by the number of data, which implies the center of gravity of the data. The mean is expressed as $\mu$ (read mu), and the mean of (Data 4.1) is obtained as follows: $$ \text{Mean} = \mu = \frac{63 + 60 + 65 + 55 + 77}{5} = \frac{320}{5} = 64 $$ When $n$ number of data is expressed as $x_1 , x_2 , ... , x_n $, mean is expressed by the following formula.

$$ \small \mu = \frac{1}{n} \sum_{i=1}^n x_i $$

In general, the mean is very appropriate as a representative value of the data, but when there is a very large or a small value in the data, it is greatly affected by this extreme value. In this case, a median can be used. The median is the value in the middle when the data are sorted in order. In (Data 4.1), the number of data is 5 which is an odd number, and the 3rd ($\frac{n+1}{2}$) number when data is sorted in ascending order is the median as follows:

(Data 4.1) is sorted in ascending order.

$\qquad$55 60 63 65 77

Median is the 3rd number in these sorted data which is 63.

If the number of data is 6 which is an even number, how do we find the median? In this case, the median of the data is calculated as the average of the 3rd (=$\frac{n}{2}$ ) and 4th (=$\frac{n+2}{2}$ ) of the sorted data.

Generally, a median is expressed as $m$, and, if the number of data is $n$ , it is calculated as follows:

1) Data are sorted in ascending order.
2) Check whether the number of data is an odd number or an even number.
3) If $n$ is odd, $m$ = $ (\frac{n+1}{2})^{th}$ data in sorted data.
$\quad \;$If $n$ is even, $m$ = Average of ($\frac{n}{2})^{th}$ data $(\frac{n+2}{2})^{th}$ data in sorted data.

In order to see the overall distribution of the weight data, a stem and leaf plot or a histogram discussed in Chapter 3 can be considered, but a dot graph is more useful. In a dot graph, after obtaining the minimum and maximum values of data, the position of each data is calculated on the horizontal axis, and displayed as a dot.

<Figure 4.1> is a dot graph for (Data 4.1). In proportion to the minimum value of 55 and the maximum value of 76, each data is displayed by a dot. The green line is the mean and the red line is the median. In this data, the mean is located slightly to the right of the median because 77 of the data is located to the right of the other four data. That is, the mean is more sensitive to an extreme than the median.

<Figure 4.1> Dot graph of weight data

If there are lots of data, it is time-consuming and difficult to obtain the mean and median manually as above. Let's find the representative value of the data using 『eStatM』 software.

🎲 Practice 4.1

Using 『eStatM』, draw a dot graph for the weights of 5 students (Data 4.1) and find the mean and median.

Solution

Enter students' weight data in 'Data input'. (You can also copy and paste the data from the e-book)

When data are entered, the number of data, minimum, maximum, mean, and median are calculated immediately. If you click the [Execute] button, a dot graph appears as shown in <Figure 4.1> and the mean and median values are displayed.

<Figure 4.3> Simulation window to see a change in mean and median if you move a point

Bottom graph of <Figure 4.3> is a simulation window. In this simulation window, you can move a point with the mouse to see the changes in the mean and median. For example, if you drag the rightmost point and move it to the right, the mean changes but the median does not. That is, the median is not affected by the extreme points.

🎲 Practice 4.2

Using 『eStatM』, let's find the mean and median of the daily minimum temperature ([Practice 3.2]) in Seoul in February (Data 3.2).

(Data 3.2) Daily minimum temperature in February 2021 in Seoul (unit: degree in Celcius)

-2.3 -8.2 -9.4 -7.4 -4.4 4.3 -2.6 5.4 -6.1 -1.5 1.3 0.6 1.0 6.4 -5.2 -7.0 -10.4 -10.6 -7.1 5.5
4.7 0.4 -3.1 -3.0 0.7 0.5 4.3 3.2

Solution

If you select ‘Dot Graph – Mean/Standard Deviation’ from the 『eStatM』 menu using the QR on the left, the data input window as shown in <Figure 4.4> appears.

<Figure 4.4> Temperature data input for a dot graph

When the daily minimum temperature data are entered in 'Data input' (you can copy and paste the data from the e-book), as shown in <Figure 4.4>, it shows immediately that the number of data is 28, mean –1.79, median –1.90, minimum -10.6, maximum 6.4 degrees.

If you click the [Execute] button, a dot graph as shown in <Figure 4.5> appears and the mean ($\mu$) and median ($\m$) are displayed. Below this dot graph, a simulation window appears where you can change a point with the mouse and observe the changes in the mean a

<Figure 4.5> Dot graph of daily minimum temperature and its simulation window

Looking at this dot graph, it can be seen that there is almost no difference between the mean and the median due to the absence of extreme value.

⏱ Exercise 4.1

The following is data on the length of bicycle-only roads by 25 administrative districts in Seoul as of 2019 ([Exercise 3.1]). Use 『eStatM』 to draw a dot graph, and to find and analyze the representative values of data.

(Data 3.3) Length of bicycle-only roads by 25 administrative districts in Seoul as of 2019 (unit: km)

24 15 23 20 30 24 7 8 7 12 28 27 19 35 41 42 11 8 37 13 20 29 53 93 42

⏱ Exercise 4.2

The following is data on the maximum wind speed of typhoons that passed through Korea in 2020 ([Exercise 3.2]). Use 『eStatM』 to draw a dot graph, and to find and analyze the representative values of data.

(Data 3.4) Maximum wind speed of typhoons that passed through Korea in 2020 (unit: m/sec)

40 22 21 29 19 22 24 45 49 55 24 27 29 35 19 24 35 40 56 24 21 43 18

A. Calculation of mean using frequency table

⭐ Think

Assume that a frequency table of the academic achievement test scores of a middle school class is given as follows:

[Table 4.1] Frequency table of the academic achievement test scores of a middle school

Class	Number of data
60≤ ~ <70	2
70 ~ 80	5
80 ~ 90	10
90 ~ 100	3
Total	20

💎 Explore

How do we find the mean using this frequency table?

When a frequency table is given rather than the raw data, the mean can be obtained approximately as follows using the middle values of each class interval.

First, find the middle value of each class interval. Then, it is assumed that each class has the middle value as many as the frequency, and the mean is obtained using this approximated data.

[Table 4.2] Approximated data using the middle value of each class interval in the academic achievement test scores

Weight (kg)	Middle value	Frequency	Approximated data
60≤ ~ <70	65	2	65 65
70 ~ 80	75	5	75 75 75 75 75
80 ~ 90	85	10	85 85 85 85 85 85 85 85 85 85
90 ~ 100	95	3	95 95 95
Total		20

Mean is calculated as follows: $$ \small \begin{align} \text{Mean} &= \frac{65+65+75+75+75+75+75+85+85+85+85+85+85+85+85+85+85+95+95+95}{20} \\ &= \frac{65 \times 2 + 75 \times 5 + 85 \times 10 + 95 \times 3} {20} \\ &= \frac{1640}{20} = 82 \end{align} $$

Using ‘Frequency Distribution Polygon - Relative Frequency Comparison’ of 『eStatM』, the approximate mean of the frequency table can be obtained as shown in <Figure 4.6>. After entering the left value of the class interval and ‘Frequency 1’, click the [Execute] button.

<Figure 4.6> Mean calculation using a frequency table

4.2 Measure of Dispersion – Standard Deviation

⭐ Think

The quiz scores (out of 10) of five middle school students are as follows:

(Data 4.2) The quiz scores (out of 10) of five middle school students

6 8 7 4 10

💎 Explore

Is there a way to measure how scattered these data are?

The degree to which data are scattered is called a dispersion. A simple measure of the dispersion is a range which is the maximum minus the minimum. $$ \text{Range} = \text{Maximum - Minimum} $$ In (Data 4.2), the maximum value is 10 and the minimum value is 4, so the range is 22. $$ \text{Range} = \text{77 - 55 = 22} $$

Since the range is too sensitive to extreme values, a variance or a standard deviation is generally used to measure the dispersion. The variance is obtained by squaring the distance between each data value and the mean, and dividing it by the number of data. Therefore, when the data are scattered far from the mean, the variance is large, and when the data are clustered around the mean, the variance is small. The variance is expressed as $\sigma^2$ (read as sigma squared).

The mean of the data in (Data 4.2) is as follows: $$ \text{Mean} \quad \mu ~=~ \frac{6+8+7++4+10}{5} ~=~ \frac{35}{5} ~=~ 7 $$

The variance is calculated by squaring the distances from the mean to each data value to find the sum, and then finding the mean. That is, it is the average of squared distances. $$ \begin{align} \text{Variance} \quad \sigma^{2} &~=~ \frac{ (6-7)^2 + (8-7)^2 + (7-7)^2 + (4-7)^2 + (10-7)^2} {5} \\ &~=~ \frac{20}{5} ~=~ 4 \end{align} $$ When $n$ number of data is expressed as $x_1 , x_2 , ... , x_n$ and the mean is expressed as $\mu$, the variance can be expressed by the following formula. $$ \begin{align} \text{Variance} \quad \sigma^{2} ~=~ { {1 \over n} {\sum _{i=1} ^{n} (x_{i} - \mu )^{2}} } ~~~~ (n:~자료수) \\ \end{align} $$

The standard deviation is defined as the square root of the variance and denoted by $\sigma$. The variance is not easy to interpret practically because it is the average of the squared distances, but the standard deviation is the square root of the variance, so it can be interpreted as a measure of the average distance between each data value and the mean. $$ \text{Standard deviation} \quad \sigma ~=~ \sqrt{\sigma^2} \\ $$ The standard deviation of (Data 4.2) is $\sigma$ = $\sqrt{\sigma^2}$ = $\sqrt{4}$ = 2 이다.

🎲 Practice 4.3

Using 『eStatM』, draw a dot graph for the quiz scores of 5 sample students (Data 4.2) and find the mean and standard deviation.

Solution

Select ‘Dot Graph – Mean / Standard Deviation’ from the 『eStatM』 menu. Then a window like <Figure 4.7> appears.
Enter students' quiz scores in 'Data input'. (You can also copy and paste the material from the e-book)

When the data are entered, the number of data, minimum, maximum, mean, median, variance and standard deviation are calculated. If you click the [Execute] button, a dot graph as shown in <Figure 4.8> appears and the mean, median, standard deviation, and a line of mean $\pm$ standard deviation are displayed.

Using the simulation window below the figure, you can check the change in the standard deviation by moving a data point with the mouse. The standard deviation is also affected by an extreme point.

<Figure 4.8> Dot graph with a line of mean $\pm$ standard deviation

🎲 Practice 4.4

Using 『eStatM』, let's draw a dot graph for the daily minimum temperature ([Practice 3.2]) in Seoul in February (Data 3.2) and find the mean and standard deviation.

Solution

If you select ‘Dot Graph – Mean / Standard Deviation’ from the 『eStatM』 menu that appears using the QR on the left, the data input window as shown in <Figure 4.9> appears.

<Figure 4.9> Temperature data input for a dot graph

When data are entered, the number of data, minimum, maximum, mean, median, variance and standard deviation are calculated. If you click the [Execute] button, a dot graph as shown in <Figure 4.10> appears and the mean, median, standard deviation, and a line of mean $\pm$ standard deviation are displayed.

Using the simulation window below the figure, you can check the change in the standard deviation by moving a point with the mouse. The standard deviation is also affected by an extreme point.

<Figure 4.10> Dot graph of daily minimum temperature and a simulation window

⏱ Exercise 4.3

The following is data on the length of bicycle-only roads by 25 administrative districts in Seoul as of 2019 ([Exercise 3.1]). Use 『eStatM』 to draw a dot graph and to find and analyze the mean and standard deviation of the data.

(Data 3.3) Length of bicycle-only roads by 25 administrative districts in Seoul as of 2019. (unit km)

24 15 23 20 30 24 7 8 7 12 28 27 19 35 41 42 11 8 37 13 20 29 53 93 42

⏱ Exercise 4.4

The following is data on the maximum wind speed of typhoons that passed through Korea in 2020 ([Exercise 3.2]). Use 『eStatM』 to draw a dot graph and find and analyze the mean and standard deviation of data.

(Data 3.4) Maximum wind speed of typhoons that passed through Korea in 2020 (unit m/sec)

40 22 21 29 19 22 24 45 49 55 24 27 29 35 19 24 35 40 56 24 21 43 18

A. Calculation of standard deviation using frequency table

⭐ Think

Assume that the frequency table of the academic achievement test scores of a middle school class is given as follows:

[Table 4.3] Frequency table of the academic achievement test scores of a middle school

Class	Number of data
60≤ ~ <70	2
70 ~ 80	10
80 ~ 90	15
90 ~ 100	3
Total	30

💎 Explore

How to find the standard deviation of the data in this frequency table?

In the previous section, when a frequency table was given rather than the raw data, the mean was approximated using the middle value of each class interval. The standard deviation is calculated in a similar way.

First, find the middle value of each class. Then, it is assumed that each class has the middle value as many as the frequency, and the average is obtained using this approximate data.

[Table 4.4] Approximated data using the middle value of each class interval in the academic achievement test scores

Weight (kg)	Middle value	Frequency	Approximated data
60≤ ~ <70	65	3	65 65
70 ~ 80	75	7	75 75 75 75 75
80 ~ 90	85	11	85 85 85 85 85 85 85 85 85 85
90 ~ 100	95	5	95 95 95
Total		30

Mean is calculated as follows: $$ \text{Mean} ~=~ \frac{65 \times 2 + 75 \times 5 + 85 \times 10 + 95 \times 3} {20} ~=~ \frac{1640}{20} ~=~ 82 $$

The variance and standard deviation are calculated in a similar way. $$ \small \begin{align} &\text{Variance} \qquad \qquad \sigma^2 ~=~ \frac{(65-82)^2 \times 2 + (75-82)^2 \times 5 + (85-82)^2 \times 10 + (95-82)^2 \times 3} {20} \\ &\qquad\qquad\qquad\qquad \;\; ~=~ \frac{1420}{20} ~=~ 71 \\ &\text{Standard deviation} \qquad \sigma ~=~ \sqrt{\sigma^2} ~=~ \sqrt{71} ~=~ 8.43 \end{align} $$

Using ‘Frequency Distribution Polygon - Relative Frequency Comparison’ of 『eStatM』, the approximate mean and standard deviation of the frequency table can be obtained as shown in <Figure 4.6>. After entering the left value of the class interval and ‘Frequency 1’, click the [Execute] button.

<Figure 4.11> Standard deviation calculation using a frequency table

4.3 Covariance and Correlation Coefficient

⭐ Think

The height and weight of 7 male middle school students were investigated as follows:

(Data 4.3) Height and weight of 7 male middle school students

	1	2	3	4	5	6	7
Height	162	164	170	158	175	168	172
Weight	54	60	64	52	65	60	67

💎 Explore

Is there a measure to determine the correlation between two quantitative variables?

Just as the variance is used as a measure of dispersion in one quantitative variable, the following covariance is used in two quantitative variables. When $n$ number of x and y data are expressed as $ (x_1 , y_1 ), (x_2 , y_2 ), ... , (x_n , y_n ) $, and the mean is expressed as $ (\mu_x , \mu_y )$, the covariance $\sigma_{xy}$ can be expressed by the following formula. $$ \text{Covariance} \quad \sigma_{xy} ~ =~ \frac{1}{n} \sum _{i=1} ^{n} (x_{i} - \mu_x ) (y_{i} - u_y ) \qquad (n:\text{ number of data} ) $$

Covariance implies the total average of the values obtained by multiplying the x-axis distance and the y-axis distance between each data point and the mean point of data. Therefore, if there are many points on the upper right and lower left of the mean point, the covariance has a positive value, indicating a positive correlation. If there are many points on the upper left and lower right of the mean point, the covariance has a negative value, indicating a negative correlation. However, since covariance can increase in value depending on the unit of data, the following correlation coefficient denoted as $\rho$ is used as a measure of correlation. $$ \text{Correlation coefficient} \quad \rho ~ =~ \frac{\sigma_{xy}}{\sigma_x \sigma_y} $$

The correlation coefficient is a variation of the covariance and can only have values between -1 and +1. When the correlation coefficient is close to +1, the two variables are said to have a strong positive correlation, and when the correlation coefficient is close to -1, it is said to have a strong negative correlation. When the correlation coefficient is close to 0, there is no correlation between the two variables.

🎲 Practice 4.5

Using 『eStatM』, calculate the covariance and correlation coefficient of the height and weight of 7 students. .

Solution

Enter students' height in 'Enter X data' and their weight in 'Enter Y data'. (You can also copy and paste the material from the e-book)

After the data input, click the [Execute] button. Then the number of data, mean, variance, standard deviation, covariance and correlation coefficient are calculated as in <Figue 4.12> and a scatter plot is displayed.

As shown in <Figure 4.12>, the covariance of height and weight in (Data 4.3) is 27 and the correlation coefficient is 0.94, indicating a strong positive correlation.

<Figure 4.13> Scatter plot using height and weight data

If you check the ‘regression line’ under the scatterplot, a regression line that explains the relationship between height and weight is drawn.

If the correlation is strong, a straight line that can explain the relationship between the variables is obtained, which is called a regression line. A detailed explanation of the regression line is covered in university level of statistics.

🎲 Practice 4.6

Using 『eStatM』, make a point on the plane and observe the correlation coefficient and regression line while moving.

Solution

If you decide a correlation coeffcient in 『eStatM』, a scatter plot of the correlation coefficient appears as shown in <Figure 4.15>.

<Figure 4.15> Simulation of a correlation coefficient