Calculating Measures of Dispersion in Statistics with Python

Measures of dispersion in statistics provide an indication of the variability or spread of data within a set. In other words, they show how much the data deviates from the mean or central value. These measures are critical because they provide valuable information about the distribution and consistency of data, allowing analysts to better understand the nature and characteristics of a data set.

Calculating Measures of Dispersion header

Dispersion measures

Imagine you have a dataset that represents a student’s grades in a course. The average of these grades might give you a general idea of ​​the student’s average performance in the course, but measures of dispersion help you understand how much these grades may vary around the average. If grades are widely dispersed, it could indicate a variety of factors, such as differences in student preparation, course quality, or subjectivity of the grading method.

Measurements of dispersion in statistics are used to understand how much the data in a set deviate from the average value. These measures provide information about the variability of the data. Some of the most common dispersion measures include:

Variance and Standard Deviation: These measures indicate how dispersed the data is around the mean. Higher values ​​indicate greater dispersion, while lower values ​​indicate greater cohesion of the data around the mean. They are useful for evaluating the accuracy or reliability of data.

The Variance (σ²) is calculated as the average of the squares of the differences between each value (xi) and the mean (μ) of the data set (N):

 \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2

Where:

  •  \sigma^2 [latex] is the variance.</li> <!-- /wp:list-item -->  <!-- wp:list-item --> <li>[latex] x_i are the individual values ​​in the data set.
  •  \mu is the average of the data set.
  •  N is the total number of values ​​in the data set.

The Standard Deviation (σ) is simply the square root of the variance:

 \sigma = \sqrt{\sigma^2}

So:

 \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}

These measures are useful for evaluating how “sparse” the data is compared to the average, providing a measure of the precision or reliability of the data itself. For example, if we are studying student performance on an exam, a high standard deviation might indicate greater variability in student performance.

Range: Range provides a direct indication of data dispersion by showing the distance between the maximum value and minimum value in the data set. It is useful for getting a general understanding of the distribution of your data and for quickly identifying the range of values ​​within which your data falls.

The range is the difference between the maximum value (max) and the minimum value (min) in the data set:

 \text{Intervallo} = \text{max} - \text{min}

It is useful for getting a general understanding of the distribution of your data and for quickly identifying the range of values ​​within which your data falls. However, the range can be affected by extreme values ​​or outliners.

Mean Absolute Deviation (MAD): The mean absolute deviation is a measure of dispersion that takes into account the average distance between each value and the mean. It is less sensitive to outliners than variance, making it a better choice in some situations. It is particularly useful when you want a measure of dispersion that is robust to data extremes and that provides an estimate of the average variability of the data.

It is calculated as the average of the absolute differences between each value and the mean:

 \text{MAD} = \frac{1}{N} \sum_{i=1}^{N} |x_i - \mu|

The mean absolute deviation is a measure of dispersion that takes into account the average distance between each value and the mean. It is less sensitive to outliners than variance, making it a better choice in some situations. It is particularly useful when you want a measure of dispersion that is robust to data extremes and that provides an estimate of the average variability of the data.

Percentiles: Percentiles divide the data into ordered parts. They are useful for understanding the distribution of data and for identifying extreme values ​​or outliners. For example, the 25th percentile indicates the value below which 25% of the data falls. Percentiles are often used to compare values ​​or identify thresholds of interest within a data set.

In summary, dispersion measures are critical to understanding the variability and distribution of data. They help analysts gain a more complete and detailed view of data, allowing them to make more accurate predictions, identify trends and make informed decisions.

Calculating dispersion measures with Python

To calculate the dispersion measures just seen through Python code, you can use the NumPy and Statistics libraries.

Variance and Standard Deviation:

import numpy as np

# Sample data
data = np.array([10, 20, 30, 40, 50])

# Calculating variance
variance = np.var(data)

# Calculating standard deviation
std_deviation = np.std(data)

print("Variance:", variance)
print("Standard Deviation:", std_deviation)

Running you get:

Variance: 200.0
Standard Deviation: 14.142135623730951

Range:

# Calculating range
range_value = np.max(data) - np.min(data)

print("Range:", range_value)

Running you get:

Range: 40

Mean Absolute Deviation (MAD):

import statistics

# Calculating mean absolute deviation
mad = statistics.mean([abs(x - np.mean(data)) for x in data])

print("Mean Absolute Deviation (MAD):", mad)

Running you get:

Mean Absolute Deviation (MAD): 12.0

Percentiles:

# Calculating percentiles
percentile_25 = np.percentile(data, 25)
percentile_50 = np.percentile(data, 50)  # Median
percentile_75 = np.percentile(data, 75)

print("25th Percentile:", percentile_25)
print("Median (50th Percentile):", percentile_50)
print("75th Percentile:", percentile_75)

Running you get:

25th Percentile: 20.0
Median (50th Percentile): 30.0
75th Percentile: 40.0

Leave a Reply