Centrality Measurements of a Distribution with Python

Centrality Measurements of a distribution header

Centrality measures, such as the mean, median, and mode, identify the typical value of a data set, providing a reference point for understanding the distribution. These measures work synergistically with measures of dispersion, such as standard deviation and IQR, to quantify the variability around the central value. Considering both of these aspects offers a comprehensive perspective of the distribution, essential for statistical modeling, informed decisions, and the accurate description of data.

[wpda_org_chart tree_id=11 theme_id=50]

Centrality Measurements of a Distribution

Centrality measures are used to identify the central or typical point of a data distribution. These measures provide information about the central value around which the other data clusters. Some of the most common centrality measures include:

Arithmetic mean:

\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

The arithmetic mean is the sum of all values divided by the number of observations. It is affected by all values and may be sensitive to outliers.

Median:

Calculation: Sort the data and select the value that divides the distribution into two equal parts.

The median is the central value of an ordered data set. It is less sensitive to outliers than the mean and offers a robust measure of central tendency.

Mode:

Calculation: The most frequent value or values in the distribution.

Mode is the most frequent value in a data set. A distribution can have one mode (unimodal) or more than one mode (multimodal).

Geometric Mean:

M_G = \left(\prod_{i=1}^{n} x_i\right)^{\frac{1}{n}}

The geometric mean is useful for data that grows or falls exponentially, such as percentage growth rates.

Harmonic Mean:

M_H = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}

The harmonic mean is particularly sensitive to smaller values and can be useful when dealing with reciprocal relationships.

Root Mean Square:

RMS = \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_i^2}

Often used to measure quantities in physics or engineering, it is the square root of the mean of the squares of the data.

Percentiles:

Calculation: Division of data into 100 percentage parts. The 50th percentile corresponds to the median, while the 25th and 75th percentile correspond to the first and third quartiles.

Centrality measures are critical to understanding the point around which the data clusters. The choice of measure depends on the nature of the data and the specific analytical objectives. The arithmetic mean is commonly used, but the median is often preferred when the distribution is affected by outliers.

Python Data Analytics

If you want to delve deeper into the topic and discover more about the world of Data Science with Python, I recommend you read my book:

Python Data Analytics 3rd Ed

Fabio Nelli

Python example of calculating centrality measures

Below you will find an example in Python to calculate some of the centrality measures mentioned. We will use the NumPy module for data manipulation and statistics calculation.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import mode

# Example data
data = np.array([10, 15, 18, 22, 25, 30, 35, 40, 50, 18])

# Calculation of centrality measures
mean = np.mean(data)
print('Mean: ', mean)
median = np.median(data)
print('Median: ', median)
mode_result = mode(data)
print('Mode: ', mode_result.mode, ', count(', mode_result.count, ')')
geo_mean = np.exp(np.mean(np.log(data)))
print('Geometric Mean: ', geo_mean)
harmonic_mean = len(data) / np.sum(1 / data)
print('Harmonic Mean: ', harmonic_mean)
rms = np.sqrt(np.mean(data**2))
print('RMS: ', rms)
percentiles = np.percentile(data, [25, 50, 75])
print('Percentiles: ', percentiles)

Executing the following numerical values is obtained:

Mean:  26.3
Median:  23.5
Mode:  18 , count( 2 )
Geometric Mean:  23.709468878323285
Harmonic Mean:  21.27039179877534
RMS:  28.82186669874108
Percentiles:  [18.   23.5  33.75]

We can add some code to graphically represent everything:

plt.hist(data, bins='auto', color='blue', alpha=0.7)
plt.axvline(x=mean, color='red', linestyle='dashed', linewidth=2, label='Media Aritmetica')
plt.axvline(x=median, color='green', linestyle='dashed', linewidth=2, label='Mediana')
plt.axvline(x=mode_result.mode, color='purple', linestyle='dashed', linewidth=2, label='Moda')
plt.legend()
plt.title('Data Distribution with Centrality Measures')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

plt.bar(['Mean', 'Median', 'Mode', 'Geom.M.', 'Arm.M.', 'RMS'],
        [mean, median, mode_result.mode, geo_mean, harmonic_mean, rms], color=['red', 'green', 'purple', 'orange', 'pink', 'blue'])
plt.title('Centrality Measures')
plt.ylabel('Values')
plt.show()

plt.boxplot(data, vert=False)
plt.scatter(percentiles, [1, 1, 1], color='red', marker='o', label='Percentili')
plt.title('Boxplot with Percentiles')
plt.xlabel('Values')
plt.yticks([])
plt.legend()
plt.show()

This code creates three graphs:

  1. A histogram showing the distribution of the data and vertical lines for the mean, median, and mode.
  2. A bar diagram representing centrality measures.
  3. A boxplot with percentiles displayed as red dots.
Centrality Measurements fig 1
Centrality Measurements fig 2
Centrality Measurements fig 3
Book - Practical Statistics for Data Scientists

Recommended Book:

If you are interested to this topic, I suggest to read this:

Practical Statistics for Data Scientists

The importance of centrality measures of a dispersion

Centrality measures and dispersion measures are both crucial components for describing and understanding a data set. Together, they provide a complete picture of the statistical distribution and are crucial in multiple contexts. Here is the importance of considering centrality and dispersion measures together:

Overall Data Description: Centrality measures, such as mean, median, and mode, provide a point of reference for identifying the typical or representative value of the data set. Measures of dispersion, such as standard deviation or IQR, quantify the variability around this central point, providing information about the degree of dispersion in the data.

Assessing Variability: Comparing the mean to the standard deviation or IQR helps assess how much the data deviates from the typical central value. A high standard deviation indicates greater variability, while a wider IQR suggests a more dispersed distribution.

Sensitivity to Outliers: Centrality measures such as the mean can be affected by outlier values, while the median is more robust to such influences. Considering both the mean and median together with the dispersion measures allows you to obtain a more complete understanding of the distribution, especially in the presence of anomalous data.

Choice of Statistical Model: When selecting and fitting statistical models, knowledge of centrality and dispersion is critical. Parametric models, for example, may require the assumption of a normal distribution of the data, meaning that the mean and standard deviation are critical parameters.

Forecasting and Decision Making: When making forecasts or decisions based on data, it is essential to understand both the core value and the variability. Centrality provides an estimate of “where,” while dispersion provides information about “how much.”

Quality and Process Control: In quality and industrial process control, centrality and dispersion measures are used to monitor process stability and variability. For example, statistical process control (SPC) uses these measures to evaluate compliance with quality requirements.

Effective Communication: Presenting both centrality and dispersion in the data provides a more complete and accurate representation of the distribution. This facilitates communication and interpretation of results to a wider audience.

In summary, understanding both centrality and dispersion measures is critical to obtaining a complete view of the data. These measures work synergistically to provide a complete picture of distribution, allowing for more accurate statistical analysis and more informed decisions.

Leave a Reply