Site icon Meccanismo Complesso

Descriptive statistics vs. Inferential statistics with Python

Descriptive Statistics vs Inferential Statistics
Descriptive Statistics vs Inferential Statistics header

In this article, we will explore together the differences between descriptive statistics and inferential statistics, highlighting their specific roles in statistical analysis and illustrating how they can be used in a complementary way to obtain a complete understanding of the data.

Descriptive Statistics vs Inferential Statistics

Descriptive statistics focuses on presenting and organizing data in a clear and understandable way. It includes central measures such as the mean, median, and mode, as well as the standard deviation and range. In essence, descriptive statistics summarizes and describes the main characteristics of a data set without making any inferences beyond what is observed.

On the other hand, inferential statistics goes beyond simply describing data. It relies on using samples to make inferences or predictions about the larger population from which the sample comes. This type of statistic includes hypothesis tests, interval estimates, and regression. The goal is to make generalizable statements or predictions about the population based on a representative sample.

In short, while descriptive statistics summarizes and organizes data, inferential statistics draws broader conclusions and makes predictions based on those conclusions. Both are crucial in the field of statistics, as they work synergistically to provide a comprehensive understanding of data and to support informed decisions.

Example with Python

Supponiamo di avere un set di dati che rappresenta l’altezza di un gruppo di studenti e vogliamo esplorare sia la statistica descrittiva che quella inferenziale su tale campione utilizzando del codice Python.

import numpy as np
from scipy import stats

# Let's create a data set (height in centimeters)
sample_heights = [160, 165, 170, 155, 175, 180, 162, 168, 172, 158]

# Descriptive Statistics
mean_heights = np.mean(sample_heights)
median_heights = np.median(sample_heights)
std_deviation = np.std(sample_heights)

print(f"Mean of heights: {mean_heights:.2f} cm")
print(f"Median of heights: {median_heights:.2f} cm")
print(f"Standard deviation of heights: {std_deviation:.2f} cm")

# Inferential Statistics
confidence_level = 0.95
confidence_interval = stats.norm.interval(confidence_level, loc=mean_heights, scale=std_deviation/np.sqrt(len(sample_heights)))

print(f"\nConfidence interval of {confidence_level * 100}% for the mean of the heights: ({confidence_interval[0]:.2f} cm, {confidence_interval[1]:.2f} cm)")

# Example of hypothesis testing (let's assume the average is 170 cm)
mean_hypothesis = 170
t_stat, p_value = stats.ttest_1samp(sample_heights, mean_hypothesis)

print(f"\nHypothesis testing:")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# t-value valutation
if abs(t_stat) > 2:
    print("\nThe t value is significantly different from zero.")
else:
    print("\nThe t value is not significantly different from zero.")

# We compare the p-value with a significance level (for example, 0.05)
level_of_significance = 0.05
if p_value < level_of_significance:
    print("The average of the heights in the sample is significantly different from 170 cm.")
else:
    print("There is insufficient evidence to reject the hypothesis that the average height is 170 cm.")

By running the code, the result of the two statistical analyzes on the height sample is obtained:

Mean of heights: 166.50 cm
Median of heights: 166.50 cm
Standard deviation of heights: 7.54 cm

Confidence interval of 95.0% for the mean of the heights: (161.83 cm, 171.17 cm)

Hypothesis testing:
T-statistic: -1.3926
P-value: 0.1972

The t value is not significantly different from zero.
There is insufficient evidence to reject the hypothesis that the average height is 170 cm.

The code example provided uses Python and the NumPy and SciPy libraries to illustrate applying descriptive and inferential statistics to a sample of data. Here is a detailed description of the code:

Se vuoi approfondire l’argomento e scoprire di più sul mondo della Data Science con Python, ti consiglio di leggere il mio libro:

Python Data Analytics 3rd Ed

Fabio Nelli

  1. Sample Selection:

We use NumPy to generate a sample of student heights. The sample is generated from a normal distribution with a mean of 170 cm and a standard deviation of 8 cm.

sample_heights = [160, 165, 170, 155, 175, 180, 162, 168, 172, 158]
  1. Descriptive statistics:

We calculate the mean, median and standard deviation of the sample.

mean_heights = np.mean(sample_heights)
median_heights = np.median(sample_heights)
std_deviation = np.std(sample_heights)
  1. Printing of Descriptive Measurements:

We print the results of the descriptive measures to get an idea of the main characteristics of the sample.

print(f"Mean of heights: {mean_heights:.2f} cm")
print(f"Median of heights: {median_heights:.2f} cm")
print(f"Standard deviation of heights: {std_deviation:.2f} cm")
  1. Inferential Statistics – Confidence Interval:

We calculate a 95% confidence interval for the mean of the heights in the sample using the normal distribution.

confidence_level = 0.95
confidence_interval = stats.norm.interval(confidence_level, loc=mean_heights, scale=std_deviation/np.sqrt(len(sample_heights)))
  1. Print Confidence Interval::

We print the calculated confidence interval.

print(f"\nConfidence interval of {confidence_level * 100}% for the mean of the heights: ({confidence_interval[0]:.2f} cm, {confidence_interval[1]:.2f} cm)")
  1. Inferential Statistics – Hypothesis Testing:

We perform a hypothesis test to see if the mean of the heights in the sample is significantly different from 170 cm.

mean_hypothesis = 170
t_stat, p_value = stats.ttest_1samp(sample_heights, mean_hypothesis)
  1. Printing Hypothesis Test Results:

We print the t-statistic and the p-value associated with the hypothesis test.

print(f"\nHypothesis testing:")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
  1. T-stat Evaluation:

A t-value that is larger in absolute value indicates that the difference between the sample mean and the hypothesized mean is larger than would be expected by chance. In other words, it suggests a larger discrepancy between the sample and the null hypothesis. We therefore believe that t-values greater than 2 ( or -2 ) indicate that our assumption of an average of 170cm is not true.

if abs(t_stat) > 2:
    print("\nThe t value is significantly different from zero.")
else:
    print("\nThe t value is not significantly different from zero.")

Decision Based on P-value:

We compare the p-value with a significance level (0.05) to decide whether to reject or not reject the null hypothesis.

if p_value < level_of_significance:
    print("\nThe average of the heights in the sample is significantly different from 170 cm.")
else:
    print("\nThere is insufficient evidence to reject the hypothesis that the average height is 170 cm.")

This example demonstrates how to use both descriptive and inferential statistics on a sample of data, providing a comprehensive overview of the characteristics of the sample and the inferences we can make about the population from which the sample was drawn.

This code demonstrates how to combine descriptive and inferential statistics in Python to analyze a dataset.

Libro Suggerito:

Se sei interessato all’argomento, ti suggerisco di leggere questo libro:

Practical Statistics for Data Scientists

Inferential statistics: Evaluation of the p-value and the t-statistic

In the previous example we saw how to evaluate the accuracy of an estimated average of 170cm on a sample of heights based on the values obtained from the inferential statistics p-value and t-statistic. Let’s explain this point better.

As regards the p-value, the evaluation is quite simple. First we need to establish a significance level. The significance level, often denoted as , is the maximum acceptable probability of committing a Type I error, that is, incorrectly rejecting a true null hypothesis. Commonly, 0.05 is used as the significance level, but it can be chosen differently depending on the context.

The p-value, on the other hand, is the probability of obtaining a result at least as extreme as that observed in the sample, assuming that the null hypothesis is true. In simpler terms, it represents the probability of observing the sample data if the null hypothesis is correct.

In hypothesis testing, you compare the p-value with the significance level. If the p-value is lower than the chosen significance level, the null hypothesis is usually rejected. In other words, a low p-value suggests that the sample data is statistically significant and provides evidence against the null hypothesis. If , it is usually stated that there is sufficient evidence to reject the null hypothesis. If , you do not have enough evidence to reject the null hypothesis. This does not imply that the null hypothesis is true; there is simply not enough evidence to dismiss it.

In more concrete terms, a significance level of 0.05 indicates that you are willing to make a Type I error with a probability of 5%. If the p-value is less than 0.05, we reject the null hypothesis, stating that there is significant statistical evidence against it. If the p-value is greater than or equal to 0.05, the null hypothesis is not rejected, indicating that there is no significant statistical evidence to reject it.

In short, comparing the p-value to the significance level helps you make informed decisions about rejecting or not rejecting the null hypothesis based on the available statistical evidence.

The t-statistic (or t-value) is a measure of how much the sample mean deviates from the hypothesized population mean, expressed in terms of sample standard deviations. Here’s how to interpret the t-statistic value:

In essence, the t-value represents how much the sample mean deviates from the hypothesized mean, and a larger t-value in absolute value suggests greater strength of evidence against the null hypothesis. The final evaluation depends on the specific context, the critical value of t and the significance level chosen for the test.

Another example with an incorrect mean assumption

In the previous case we saw how the hypothesis of an average height of 170cm complies with the sample under analysis. Let’s look at another example, where this assumption is not valid. In this example, I will generate a new sample of heights with a mean of 165 cm instead of 170 cm, obtaining the values from a normal distribution. By performing the hypothesis test on the new mean, we will obtain a p-value indicating that the sample mean is significantly different from the 170 cm hypothesis.

import matplotlib.pyplot as plt

# Generates a sample with a mean other than 170 cm
heights_sample = np.random.normal(loc=165, scale=8, size=100)

# Perform hypothesis testing
t_stat, p_value = stats.ttest_1samp(heights_sample, mean_hypothesis)

# Sample distribution plot
plt.figure(figsize=(12, 6))


plt.hist(heights_sample, bins=20, color='blue', alpha=0.7)
plt.axvline(x=np.mean(heights_sample), color='red', linestyle='dashed', linewidth=2, label='Sample mean')
plt.title('Distribution of the Non-compliant Sample')
plt.xlabel('Heights (cm)')
plt.ylabel('Frequency')
plt.legend()


# Show graphs
plt.tight_layout()
plt.show()

level_of_significance = 0.05
print(f"\nHypothesis testing:")
print(f"T-statistic: {t_stat:.4f}")
print("p_value: ", p_value)

# t-value valutation
if abs(t_stat) > 2:
    print("\nThe t value is significantly different from zero.")
else:
    print("\nThe t value is not significantly different from zero.")
# Decision based on p-value
if p_value < level_of_significance:
    print("The average of the heights in the sample is NOT consistent with the hypothesis of 170 cm.")
else:
    print("There is insufficient evidence to reject the hypothesis that the average height is 170 cm.")

By running the code we will obtain the following result:

Further Considerations between descriptive statistics and inferential statistics

Descriptive analysis and inferential analysis are two distinct approaches in statistics, each with a specific role in interpreting data. When approaching a dataset, descriptive analytics is often the first step. This approach aims to provide a comprehensive overview of the main characteristics of the sample, using measures such as the mean, median, standard deviation and others to describe the distribution and variability of the data.

On the other hand, inferential analysis focuses on formulating general conclusions about the total population based on a representative sample. This type of analysis includes hypothesis tests, confidence intervals, and other techniques that allow you to make inferences about the population based on information collected from the sample.

Furthermore, as we saw in the previous example, descriptive analysis often involves visualizing data via graphs, while inferential analysis also relies on graphs but focuses more on communicating the results of hypothesis tests.

When comparing the two approaches, it is important to note that descriptive analysis provides an initial, understandable view of the data, while inferential analysis adds a level of depth, allowing you to draw broader conclusions about the population. However, inferential analysis carries the risk of errors, such as Type I error and Type II error, and often requires compliance with some fundamental assumptions.

Both approaches are often used together in a comprehensive statistical analysis. Descriptive analysis helps to understand the structure and characteristics of the dataset, while inferential analysis allows you to make more advanced statements about the target population. The choice between the two depends on the specific objectives of the analysis and the nature of the data available, ensuring a balanced and informative approach to the interpretation of the results.

Exit mobile version