The Central Limit Theorem with Python

Post Views: 553

Statistics is a fundamental discipline for the analysis and interpretation of data. One of the most powerful conceptual tools in statistics is the Central Limit Theorem (CLT). This theorem is crucial to inferential statistics and provides the basis for many statistical analyzes applied in a wide range of fields.

The Central Limit Theorem

The Central Limit Theorem is one of the fundamental principles of statistics that describes the behavior of distributions of means of random samples. In essence, the theorem states that, regardless of the shape of the distribution of the starting population, the distribution of the sample means gets closer and closer to a normal (or Gaussian) distribution as the sample size increases. To understand the Central Limit Theorem, it is important to highlight some of its main foundations:

Random Samples: The theorem applies to random samples, that is, sets of observations taken randomly from a population.
Sample Size. The theorem suggests that as the sample size increases, the distribution of the sample means gets closer and closer to a normal distribution.
Starting Population. The Central Limit Theorem does not require that the starting population follows a normal distribution. This is a crucial point and makes the theorem extremely powerful in many practical applications.
Mean and Variance. The theorem states that the mean of the sample means is equal to the mean of the starting population, and the variance of the sample means is the variance of the population divided by the sample size.

If you want to delve deeper into the topic and discover more about the world of Data Science with Python, I recommend you read my book:

Python Data Analytics 3rd Ed

Fabio Nelli

Practical Applications of the Central Limit Theorem

The Central Limit Theorem has profound practical implications. For example, it allows statisticians to make inferences about the source population even when the distribution of this population is unknown or complex. Furthermore, it justifies the use of the normal distribution in statistical procedures, even when the population distribution is unknown or non-normal.

Statistical Inference: The Central Limit Theorem justifies the use of the normal distribution in statistical inference procedures, such as the construction of confidence intervals and hypothesis tests, even when the starting population does not follow a normal distribution.
Estimation of the Mean of a Population: When you want to estimate the mean of a population, the theorem allows you to use the normal distribution to approximate the distribution of the sample means, thus simplifying statistical analyses.
Quality Control: In industrial quality control, the Central Limit Theorem is often used to analyze the distributions of means of product or component samples, allowing for more accurate control of manufacturing processes.
Financial Forecasting: In financial forecasting, where variables can be influenced by multiple factors, the theorem provides a basis for treating sample means as normal distributions, thus simplifying risk and return analyses.
Market Analysis: In the field of market analysis, the Central Limit Theorem is used to interpret data collected from random sampling, allowing more accurate predictions to be made about consumer behavior and market trends.
Biology and Medicine: In biological and medical research, where populations may have complex distributions, the theorem facilitates the analysis of sample means, allowing scientists to formulate valid and generalizable conclusions based on samples.
Risk Analysis: In the insurance industry and risk analysis, where it is critical to understand the variability of losses or gains, the Central Limit Theorem is used to model and understand the distribution of sample means.
Operations Research: In optimization problems and operations research, the theorem is used to analyze the behavior of sample means, making it possible to make informed decisions based on sample data.

In summary, the Central Limit Theorem constitutes a solid theoretical basis for the application of many statistical techniques in real situations, significantly contributing to making

Book - Practical Statistics for Data Scientists

Recommended Book:

If you are interested to this topic, I suggest to read this:

Practical Statistics for Data Scientists

Numerical Example in Python

The example provided simulates the central limit theorem using the roll of a six-sided fair die. The goal is to demonstrate how the distribution of means of an increasing number of samples becomes increasingly closer to a normal distribution, regardless of the shape of the original distribution of the data.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Number of experiments to perform
num_experiments = 1000

# List of sample numbers to consider
num_samples = [10, 30, 50, 100]

# Creation of the figure
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.suptitle('Central Limit Theorem with Dice Rolling', y=1.02)

for i, n in enumerate(num_samples):
    # List to store the averages of the results of each experiment
    means_experiments = []

    # Simulation of experiments
    for _ in range(num_experiments):
        results = np.random.randint(1, 7, n)  # Throwing the dice
        mean_experiments = np.mean(results)  # Calculation of the mean
        means_experiments.append(mean_experiments)

    # Histogram plot of the averages of the results
    ax = axes[i // 2, i % 2]
    sns.histplot(means_experiments, kde=True, ax=ax, color='skyblue')

    # Add a line for the theoretical normal distribution
    mean_dado = 3.5  # Average of a six-sided fair die
    std_dev_dado = (1 / 6) ** 0.5  # Standard deviation of a fair six-sided die
    x = np.linspace(1, 6, 100)
    y = stats.norm.pdf(x, mean_dado, std_dev_dado / (n ** 0.5))
    ax.plot(x, y, 'k--', linewidth=2)

    # Labels and titles
    ax.set_title(f'{n} Samples')
    ax.set_xlabel('Mean of launch results')
    ax.set_ylabel('Density')

    # Calculate and print statistical metrics
    mean_experiments_mean = np.mean(means_experiments)
    mean_experiments_std = np.std(means_experiments)
    
    # Add text with mean and standard deviation into the graph
    ax.text(0.05, 0.9, f'Media: {mean_experiments_mean:.2f}', transform=ax.transAxes, fontsize=10)
    ax.text(0.05, 0.8, f'Dev. Std.: {mean_experiments_std:.2f}', transform=ax.transAxes, fontsize=10)

# Adjust the layout and show graphs
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

Running the above code gives the following result:

Here is a step-by-step description of the example:

Initial parameters: A fair six-sided die is considered, with the roll results following a uniform distribution.
Number of experiments and samples: 1000 experiments are performed. Four different numbers of samples (10, 30, 50, 100) are considered to demonstrate how the central limit theorem manifests itself more clearly with a larger number of samples.
Simulating Experiments: For each experiment, you roll the die the specified number of times and average the results.
Graphs: Four graphs are created, each corresponding to a different number of samples. For each graph, a histogram of the averages of the experiment results is shown, overlaid with a curve representing the expected theoretical normal distribution.
Statistical metrics: For each number of samples, the mean and standard deviation of the means of the experiment results are calculated. These metrics are printed on the console and added as text to the respective graphs.

The final goal is to visually illustrate how, by increasing the number of samples, the distribution of the means of the results becomes increasingly closer to a normal distribution, thus confirming the central limit theorem. Furthermore, by including statistical metrics in the graphs, it is possible to numerically observe how the mean and standard deviation of the means of the results converge to the values expected from the normal distribution.

Statistical metrics

In the Central Limit Theorem and its practical implications, statistical metrics are useful tools for understanding the distribution of sample means, identifying patterns, and interpreting the results of statistical analyses. The Central Limit Theorem is closely related to concepts such as the standard error, confidence interval, and margin of error. Let’s see how these concepts relate to the Central Limit Theorem:

Standard Error: The Central Limit Theorem establishes that, for sufficiently large samples, the distribution of sample means will be approximately normal, regardless of the shape of the distribution of the starting population. The standard error is a measure of the dispersion of this distribution and is often used to estimate the precision of the sample mean as an estimate of the population mean.
Confidence Interval: The confidence interval is constructed around a point estimate, such as the sample mean, considering the variability of the estimate. The Central Limit Theorem justifies the use of the normal distribution to construct these intervals, making them particularly useful when the sample size is sufficiently large.
Margin of Error: The margin of error is often calculated using the standard error and helps define how much the estimate is expected to vary around the targeted value. Additionally, the margin of error is tied to the confidence interval, indicating how much the interval expands or contracts.

In short, the Central Limit Theorem provides the theoretical context that justifies the use of these measures and concepts in practical situations. These tools are particularly useful when working with sample data and wanting to make inferences about the source population, exploiting the convergence properties of the distribution of sample means to the normal distribution.

Below is a general list of these items with some additional categories:

Position Measurements:

Average
Median
Fashion
Quantiles (percentiles, deciles, etc.)
Z-Score

Dispersion Measurements:

Standard deviation
Variance
Range (difference between maximum and minimum)
Interquartile Range (IQR)
Graphic Representations:
Box Plot
Histogram
Bar chart
Scatter Plot

Shape Measurements:

Skewness (Asymmetry)
Kurtosis (Flattening)
Description of Data:
Sample mean
Sample standard deviation
Sample median
Sample percentiles

Central Trend Measures:

Weighted average
Geometric mean
Harmonic mean

Some additional concepts that may be relevant in specific contexts include:

Covariance and Correlation: Measure the linear relationship between two variables.
Standard Error: Indicates the precision of a statistical estimate.
Coefficient of Variation: Expresses the standard deviation as a percentage of the mean.
Hypothesis Tests: Used to make decisions based on statistical evidence.

These are just a few examples, and the wide range of statistical measures reflects the complexity of data analysis and distributions. The choice of tools often depends on the nature of the data and the objectives of the statistical analysis.

Conclusions

The Central Limit Theorem is a milestone in statistical theory. Its ability to establish the normality of sample means makes it possible to apply numerous statistical methods in many real-world situations. Understanding this theorem is critical for anyone involved in analyzing data and formulating conclusions based on random sampling.