The Cumulative Distribution Function (CDF) is a mathematical function that provides the probability that a random variable is less than or equal to a certain value. In other words, the CDF provides an overview of the probability distribution of a random variable. In Python, you can use CDF through libraries like NumPy, SciPy or Statmodels. These libraries provide methods to calculate the CDF for different probability distributions, such as normal distribution, binomial distribution, Poisson distribution, etc.
The Cumulative Distribution Function (CDF)
The Cumulative Distribution Function (CDF) is a fundamental concept in statistics and probability theory. Represents the probability that a random variable takes on a value less than or equal to a certain number . Formally, the CDF of a random variable is defined as:
Property of the CDF
Non-decreasing: The CDF is a non-decreasing function. This means that if , then .
Limits:
Right continuity: The CDF is continuous from the right, which means that for each point
.
Graphic Interpretation
Graphically, the CDF is a curve that starts from the point and goes up to . For a continuous random variable, the CDF will be a continuous function, while for a discrete variable the CDF will have “jumps” at points where the random variable takes on specific values.
CDF for Discrete and Continuous Variables
Discrete Variables: For a discrete random variable, the CDF is a step function. For example, if is a variable that takes on values with probability , then the CDF is given by:
Continuous Variables: For a continuous random variable, the CDF is a continuous function. For example, if has a probability density , then the CDF is the integral of the density function:
Use of the CDF
Calculating Probabilities: The CDF can be used to calculate the probability that is in an interval ( [a, b] ):
Description of the Distribution: The CDF provides a complete description of the distribution of the random variable .
Simulation and Random Number Generation: CDF is used to transform uniform random numbers into random numbers with a specific distribution through the inversion method.
Some examples
Normal Distribution: The CDF of a normal random variable with mean ( \mu ) and standard deviation ( \sigma ) is given by the error function (erf):
Exponential Distribution: The CDF of an exponential random variable with a rate parameter is:
The Cumulative Distribution Function is a powerful tool for understanding the behavior of random variables and for performing probability calculations. Its importance lies in its ability to completely describe the distribution of a random variable and to facilitate the calculation of the probabilities associated with intervals of values.
Calculate the Cumulative Distribution Function (CDF) in Python
In Python, there are several libraries that allow you to work with the Cumulative Distribution Function (CDF). The two most common libraries are numpy and scipy. In this guide, we will see how to use both to calculate and display the CDF of a random variable.
Using numpy and matplotlib with a Discrete Variable
To calculate the CDF of a discrete random variable using numpy, we can simply add the cumulative probabilities. Then, we can use matplotlib to visualize the CDF. Here is an example:
import numpy as np
import matplotlib.pyplot as plt
# Generate random discrete data
np.random.seed(0)
data = np.random.randint(0, 10, size=100)
# Calculate relative frequencies
values, counts = np.unique(data, return_counts=True)
probabilities = counts / counts.sum()
# Calculate the CDF
cdf = np.cumsum(probabilities)
# Plot the CDF
plt.step(values, cdf, where='post')
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('CDF of a Discrete Variable')
plt.grid(True)
plt.show()
By executing this you will obtain the following graphic representation:
This code generates and displays the Cumulative Distribution Function (CDF) for a set of random discrete data between 0 and 9. Since the data is generated with np.random.randint(0, 10, size=100), we expect a approximately uniform distribution across values 0 to 9. However, because the sample is relatively small (100 data points), there may be significant variation in the relative frequencies of each value. Increasing the sample size (size=100) can make the data distribution more uniform and the CDF more regular.
Let’s see together what we did in the code. First we import the necessary libraries such as numpy and matplotlib.pyplot. numpy is a library for numerical calculation, while matplotlib.pyplot is used to create graphs. Then we set the seed of the random number generator to 0 to make the results reproducible. We generate 100 random integers between 0 and 9 (inclusive) using np.random.randint.
np.random.seed(0)
data = np.random.randint(0, 10, size=100)
We use np.unique to find the unique values in data and count how many times each value appears. values contains the unique values and counts contains the number of occurrences of each value. We calculate the relative probabilities by dividing each count by the total sum of the counts (counts.sum())
values, counts = np.unique(data, return_counts=True)
probabilities = counts / counts.sum()
We use np.cumsum to calculate the cumulative sum of probabilities. This operation provides us with the CDF (Cumulative Distribution Function), which represents the probability that a random variable takes on a value less than or equal to a certain value.
cdf = np.cumsum(probabilities)
We use plt.step to create a step plot of the CDF. The where=’post’ parameter indicates that the steps occur after each point of values. We set the x- and y-axis labels using plt.xlabel and plt.ylabel. We set the title of the graph with plt.title. We activate the graph grid with plt.grid(True).Finally, we show the graph with plt.show.
plt.step(values, cdf, where='post')
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('CDF of a Discrete Variable')
plt.grid(True)
plt.show()
Using scipy with a Continuous Variable
For a continuous random variable, we can use the normal distribution as an example. The scipy library provides ready-to-use functions.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Parameters of the normal distribution
mu, sigma = 0, 1
# Generate random data
data = np.random.normal(mu, sigma, 1000)
# Create a range of values
x = np.linspace(min(data), max(data), 1000)
# Calculate the CDF
cdf = norm.cdf(x, mu, sigma)
# Visualize the CDF
plt.plot(x, cdf)
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('CDF of the Normal Distribution')
plt.grid(True)
plt.show()
Running the code you will obtain the following graphical representation.
The Cumulative Distribution Function (CDF) for a normal distribution with mean 0 and standard deviation 1 is displayed. The CDF of a normal distribution is an S-curve (sigmoid function). It starts at 0, increases slowly at first, speeds up in the middle (around the mean), and then slows down again as it approaches 1. The CDF approaches 1 as (x) increases. This reflects the fact that the cumulative probability of a normal random variable approaches 1 for very large values of (x). For the standard normal distribution, the CDF at the point (x = 0) will be 0.5, since half of the values are to the left of the mean (0) and the other half to the right. The slope of the curve around the mean reflects the standard deviation; a larger standard deviation would make the transition smoother.
The CDF is useful for understanding the cumulative probability associated with a certain value. In summary, the output of the code shows a graphical representation of the cumulative probability of a random variable that follows a standard normal distribution, providing a clear view of how values are distributed and accumulated.
Let’s see the code used together.
data = np.random.normal(mu, sigma, 1000)
In this line, we are generating 1000 random data points from a normal distribution with mean of 0 and standard deviation of 1. This data represents a continuous random variable that follows the standard normal distribution.
x = np.linspace(min(data), max(data), 1000)
Here, we are creating an array of 1000 equally spaced values spanning the range between the minimum and maximum of the generated data. This range will be used to calculate and display the CDF.
cdf = norm.cdf(x, mu, sigma)
In this line, we are calculating the CDF for the standard normal distribution at the points specified by the array (x). scipy’s norm.cdf function calculates the cumulative probability up to each value in (x). Finally we insert the lines relating to the visualization with matplotlib.
plt.plot(x, cdf)
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('CDF of the Normal Distribution')
plt.grid(True)
plt.show()
These lines of code create a graph of the CDF:
- plt.plot(x, cdf): Plots the CDF against the values of (x).
- plt.xlabel(‘Value’): Labels the (x) axis as “Value”.
- plt.ylabel(‘CDF’): Labels the (y) axis as “CDF”.
- plt.title(‘CDF of the Normal Distribution’): Sets the title of the graph.
- plt.grid(True): Adds a grid to the graph for better readability.
- plt.show(): Show the graph.
Calculation of the CDF with statsmodels from Empirical Data
If you have a dataset and want to calculate the empirical CDF, you can use the ECDF function from the statsmodels package.
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
# Generate random data
data = np.random.normal(0, 1, 1000)
# Calculate the empirical CDF
ecdf = ECDF(data)
# Visualize the empirical CDF
plt.step(ecdf.x, ecdf.y, where='post')
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('Empirical CDF')
plt.grid(True)
plt.show()
Executing you obtain the following graphic representation
The Empirical Cumulative Distribution Function (ECDF) is displayed for a set of data sampled from a standard normal distribution. The ECDF appears as a series of steps increasing from 0 to 1. Each step in the graph represents the inclusion of a new piece of data. The height of the jump depends on the number of identical data points; in a continuous data set, all jumps will be the same height, where is the number of data points. The ECDF approximates the Theoretical CDF of the standard normal distribution. With a sufficient number of samples, the ECDF should be very close to the theoretical CDF, showing a similar S-curve.
Let’s see the code used together. We generate 1000 random data points from a normal distribution with mean 0 and standard deviation 1. This data represents a continuous random variable that follows the standard normal distribution.
data = np.random.normal(0, 1, 1000)
Then we calculate the ECDF using the ECDF function from the statsmodels library. The ECDF is a function that estimates the CDF of an empirical distribution, i.e. of the observed data.
ecdf = ECDF(data)
Finally we visualize the result via matplotlib, as seen in the previous examples.
plt.step(ecdf.x, ecdf.y, where='post')
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('Empirical CDF')
plt.grid(True)
plt.show()