Site icon Meccanismo Complesso

Longitudinal data in statistics and study techniques with Python

Longitudinal Data and study techniques with Python
Longitudinal Data and study techniques with Python header

Longitudinal data in statistics refers to observations collected on the same study unit (for example, an individual, a family, a company) repeatedly over time. In other words, instead of collecting data from different study units at one point in time, you follow the same units over time to analyze the variations and changes that occur within each unit. In this article we will discover what they are and which study techniques to apply using Python as an analysis tool.

Longitudinal Data

Longitudinal data refers to data collected through repeated observations on a set of study units over time. These observations can be collected at regular or irregular intervals over time and are used to study changes over time, developmental processes, causal relationships, and more.

Here are some examples of longitudinal data:

In all these examples, the main goal is to analyze how variables change over time and what factors influence those changes. Longitudinal data provide a unique perspective that allows you to explore temporal dynamics and gain a more complete understanding of the phenomena studied.

An example of Longitudinal Data in Python

To better understand the nature of longitudinal data we can implement a simple example with Python using the pandas module, which is commonly used to manipulate and analyze tabular data. Suppose we have a dataset representing the height of a group of children measured at six-month intervals over the course of three years.

import pandas as pd

# Creating a DataFrame with longitudinal data
data = {
    'ID': [1, 2, 3, 4, 5],            # Children's IDs
    'Age': [3, 3.5, 4, 4.5, 5],        # Children's ages (years)
    'Height_0m': [90, 92, 88, 95, 91],   # Initial height (cm)
    'Height_6m': [93, 95, 90, 98, 94],   # Height at 6 months (cm)
    'Height_1y': [96, 98, 93, 101, 97],  # Height at 1 year (cm)
    'Height_1.5y': [100, 102, 97, 105, 101],  # Height at 1.5 years (cm)
    'Height_2y': [102, 104, 99, 107, 103],    # Height at 2 years (cm)
    'Height_2.5y': [104, 106, 101, 109, 105]  # Height at 2.5 years (cm)
}

df = pd.DataFrame(data)

# Display the DataFrame
df

This code creates a DataFrame with columns for the children’s ID, their age, and height measured at 6-month intervals over the course of three years. Each row represents a child and each column represents a measurement moment. For example, the Height_0m column represents the children’s initial height, while Height_6m represents the children’s height after six months, and so on.

With this DataFrame, you can perform various longitudinal analyses, such as viewing growth trends over time, calculating children’s average growth over the years, and analyzing correlations between height and age.

Longitudinal study designs

Longitudinal study designs refer to the design of a study that involves observing a set of study units over time, in order to collect repeated data on these units over specific periods. These study designs are used to study changes over time and to better understand developmental processes, causal relationships, and other phenomena that may vary over time.

There are several types of longitudinal study designs, including:

Longitudinal data offers numerous advantages, including the ability to evaluate changes over time, analyze developmental processes, and identify causal relationships. However, they also present unique challenges, such as managing loss-to-follow-up rates and controlling variability over time. Analysis of longitudinal data often requires specialized statistical techniques, therefore, it is important to use methods for analyzing longitudinal data, such as mixed models or generalized estimating equation (GEE) models.

Mixed Models vs. Generalized Estimating Equation Models

Mixed models and generalized estimating equation (GEE) models are both statistical methods used to analyze longitudinal data or data that has inherent correlation between observations. However, they have slightly different approaches.

Mixed Effects Models:
Mixed models are a type of statistical model that takes into account the hierarchical structure of longitudinal data. These models incorporate both fixed effects and random effects. Fixed effects are parameters that are assumed to be constant across the population and are estimated directly from the model. Random effects, on the other hand, are considered sampled from a probability distribution and are used to capture variation across study units. In other words, mixed models treat study units as random samples from a larger population. Mixed models are often used to analyze longitudinal data with a hierarchical structure, such as data in which observations are grouped within individuals or other clusters.

Generalized estimating equation (GEE) models:
GEE models, on the other hand, focus on the analysis of group means and provide parameter estimates that are consistent even when the correlation structure between observations is not specified correctly. These models consider only fixed effects and do not incorporate random effects. An advantage of GEE models is that they are robust to misspecification of the correlation structure between observations. GEE models are used when you want to make inference about group means and are not interested in estimating random effects.

In summary, while mixed models are more appropriate when we want to take into account variability between study units and when we want to make inference on random effects, GEE models are more appropriate when we want to make inference on group means and when wants to maintain greater flexibility in specifying the correlation structure. Both types of models are useful tools for analyzing longitudinal data, and the choice between the two depends on the specific research questions and the characteristics of the data themselves.

Python example of Mixed Effects Models

To exemplify the use of mixed models on a longitudinal dataset, we will create a simple dataset that represents repeated measurements of a group of individuals over time. Let’s imagine we have a dataset that contains the heights of children measured at different ages.

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Creating the DataFrame with the data
data = {
    'Individual': [1, 1, 1, 2, 2, 2],
    'Age': [2, 4, 6, 3, 5, 7],
    'Height': [85, 95, 105, 88, 98, 110]
}

df = pd.DataFrame(data)
df

Here is an example of what the dataset might look like:

Each row represents an observation of an individual at a given time (age) with his or her respective height recorded.

Now, we’ll use Python along with the statsmodels module to run a mixed effects model on this data. Make sure you have installed the statsmodels module before running the code.

# Definition of the mixed effects model
model = smf.mixedlm("Height ~ Age", df, groups=df["Individual"])

# Model training
result = model.fit()

# Printing the model results
print(result.summary())

In this example:

Executing you get the following result:

        Mixed Linear Model Regression Results
=====================================================
Model:            MixedLM Dependent Variable: Height 
No. Observations: 6       Method:             REML   
No. Groups:       2       Scale:              0.5556 
Min. group size:  3       Log-Likelihood:     -7.7385
Max. group size:  3       Converged:          Yes    
Mean group size:  3.0                                
-----------------------------------------------------
           Coef.  Std.Err.   z    P>|z| [0.025 0.975]
-----------------------------------------------------
Intercept  73.307    1.156 63.389 0.000 71.040 75.574
Age         5.228    0.188 27.738 0.000  4.859  5.597
Group Var   1.051    2.746                           
=====================================================

The result obtained from the summary provides you with a complete overview of the results of your mixed model, allowing you to evaluate the effect of the independent variables on changes in height over time, while simultaneously controlling for the individual effects of the groups (individuals) in the dataset. Here’s how to interpret the various elements:

Model and dependent variable: The model used is a mixed linear model (MixedLM). The dependent (or response) variable is height (Height).

Number of observations and groups:

Estimation method: The method used to estimate model parameters is Restricted Maximum Likelihood (REML), which is a common method for estimating parameters in mixed models.

Model scale: The model scale, which represents the residual variance not explained by the model, is 0.5556.

Log-Likelihood and Convergence: The Log-Likelihood value is -7.7385, which provides a measure of the adequacy of the model. The model has converged, which indicates that the parameter estimation process has been successfully completed.

Estimated parameters:

Confidence interval: For each estimated coefficient, 95% confidence intervals are provided (0.025 to 0.975).

Now let’s also add some graphical representations to visually understand what we are studying

import matplotlib.pyplot as plt
import seaborn as sns

# Data visualization: scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Age', y='Height', hue='Individual', palette='Set1', s=100)
plt.title('Scatter plot of height measurements by age and individual')
plt.xlabel('Age')
plt.ylabel('Height')
plt.legend(title='Individual')
plt.grid(True)
plt.show()

By running you obtain the following scatterplot of the measurements taken (the longitudinal data).

We will now display the plot of residuals. The residuals plot is a useful tool for evaluating the goodness of fit of the model and identifying any patterns or violations of the model assumptions. By examining the residuals, which are the differences between the observed values and those predicted by the model, we can evaluate whether the model adequately captures the variation in the data and whether there are any residual structures that have not been modeled correctly.

Here are some of the main reasons why a residual plot is used:

# Visualization of model results: residual plot
plt.figure(figsize=(8, 6))
sns.residplot(x=result.fittedvalues, y=result.resid, lowess=True, scatter_kws={'alpha': 0.5})
plt.title('Residual plot of the mixed effects model')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.grid(True)
plt.show()

Running gives you the residual plot of the mixed model

Finally, we will graphically report the results of the model, i.e. the coefficients obtained from the mixed model. The coefficient plot shows the estimated values of the mixed model coefficients, along with the confidence intervals. This graph is useful for evaluating the estimated effect of each independent variable and for comparing their impacts on the outcome. It can also be useful for identifying variables that have a significant effect on the outcome versus those that do not. Furthermore, by comparing the estimated coefficients with their confidence intervals, we can determine whether a coefficient is statistically significant.

# Visualization of model results: coefficient plot
plt.figure(figsize=(8, 6))
sns.barplot(x=result.params.index, y=result.params.values)
plt.title('Coefficients of the mixed effects model')
plt.xlabel('Coefficient')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

Executing obtains the graph of the coefficients of the mixed model.

This is just a very basic example of how to use mixed models on longitudinal data using Python. Mixed models can be extended further to include other variables as covariates and can be adapted to meet the specific needs of your dataset and research question.

Python Example of Generalized Estimating Equations (GEE) Models

We will create an example of how to use Generalized Estimating Equation (GEE) Models on a longitudinal dataset. For the example, let’s say we have a dataset that contains repeated blood pressure measurements of a group of patients over time.

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Creating the DataFrame with the data
data = {
    'Patient': [1, 1, 1, 2, 2, 2],
    'Time': [1, 2, 3, 1, 2, 3],
    'Blood_Pressure': [120, 118, 115, 130, 128, 125],
    'Treatment': ['A', 'A', 'A', 'B', 'B', 'B']
}

df = pd.DataFrame(data)
df

Here is an example of what the dataset might look like:

Each row represents an observation of a patient at a given point in time (time) with their respective blood pressure and treatment received recorded. Now, we will use Python together with the statsmodels module to run a GEE model on this data. Make sure you have installed the statsmodels module before running the code.

# Definition of the GEE model
model = sm.GEE.from_formula("Blood_Pressure ~ Treatment", groups="Patient", data=df)

# Model training
result = model.fit()

# Print the model results
print(result.summary())

In this example:

Executing you will get the following result:

                              GEE Regression Results                              
===================================================================================
Dep. Variable:         Pressione_Sanguigna   No. Observations:                    6
Model:                                 GEE   No. clusters:                        2
Method:                        Generalized   Min. cluster size:                   3
                      Estimating Equations   Max. cluster size:                   3
Family:                           Gaussian   Mean cluster size:                 3.0
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Tue, 26 Mar 2024   Scale:                           6.333
Covariance type:                    robust   Time:                         09:21:33
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept          117.6667   4.74e-15   2.48e+16      0.000     117.667     117.667
Trattamento[T.B]    10.0000    6.7e-15   1.49e+15      0.000      10.000      10.000
==============================================================================
Skew:                         -0.2391   Kurtosis:                      -1.5000
Centered skew:                -0.2391   Centered kurtosis:             -1.5000
==============================================================================

One way to graph the results of a GEE model would be via a bar graph showing the mean blood pressure for each treatment, along with confidence intervals. This allows us to easily view the average differences in blood pressure between different treatments.

import matplotlib.pyplot as plt

# Calculate blood pressure averages for each treatment
mean_pressure = df.groupby('Treatment')['Blood_Pressure'].mean()

# Calculate standard errors for each treatment
std_error = df.groupby('Treatment')['Blood_Pressure'].std() / (df.groupby('Treatment')['Blood_Pressure'].count() ** 0.5)

# Plotting
plt.figure(figsize=(8, 6))
mean_pressure.plot(kind='bar', yerr=std_error, capsize=5, color=['blue', 'green'], alpha=0.7)
plt.title('Average Blood Pressure by Treatment')
plt.xlabel('Treatment')
plt.ylabel('Blood Pressure')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()

Executing this will give you the following graph:

This is just a very basic example of how to use GEE models on longitudinal data using Python. GEE models can be extended further to include other variables as covariates and specify different correlation structures between observations.

Longitudinal data study

Analyzing longitudinal data involves a series of steps and the use of different statistical parameters to understand and interpret the dynamics of the data over time. Here is an overview of some of these metrics and why they are calculated:

The calculation of these parameters is essential for a correct interpretation of longitudinal data and to obtain a more in-depth understanding of developmental processes, risk and protective factors, as well as the effectiveness of interventions over time. It also helps mitigate potential sources of bias and make the analysis more accurate and reliable. We will see them one by one in the following part of the chapter with some simple examples.

The Attrition Rate

Attrition rate, also known as loss-to-follow-up rate or dropout rate, refers to the percentage of participants in a longitudinal study who are no longer available or cannot be followed over time. This can happen for a variety of reasons, including refusal to continue participation, loss of contact with participants, relocation, or death.

Attrition rate is an important factor to consider when analyzing longitudinal data, as it can influence the validity and reliability of conclusions drawn from the study. A high attrition rate can lead to problems with sample representativeness, bias in results, and reduction in statistical power.

To manage attrition, scholars often adopt several strategies, such as improving data collection methods, offering incentives for participants to stay in the project, maintaining a good relationship with participants over time, and l appropriate analysis of missing data.

In general, it is important to carefully monitor the attrition rate and consider its implications when interpreting longitudinal study results.

Here is an example of how to calculate friction rate in Python using a Pandas DataFrame:

import pandas as pd

# Creating a dummy DataFrame with the data
data = {
    'Individual': [1, 2, 3, 4, 5],
    'Number of Observations': [5, 4, 3, 2, 1]  # Number of observations for each individual
}

df = pd.DataFrame(data)

# Calculating the total number of observations
total_observations = df['Number of Observations'].sum()

# Calculating the total number of individuals
total_individuals = len(df)

# Calculating the number of missing observations
missing_observations = total_observations - total_individuals

# Calculating the attrition rate
attrition_rate = (missing_observations / total_observations) * 100

print("Total number of observations:", total_observations)
print("Total number of individuals:", total_individuals)
print("Number of missing observations:", missing_observations)
print("Attrition rate:", attrition_rate, "%")

In this example, we have a DataFrame that contains the number of observations for each individual. The attrition rate is calculated as the percentage of missing observations out of the total number of observations. Finally, we print the results.

Total number of observations: 15
Total number of individuals: 5
Number of missing observations: 10
Attrition rate: 66.66666666666666 %

There are several classic ways to visualize the attrition rate. One of the most common ways is to use a bar chart or pie chart to show the proportion of missing observations out of total observations. Here is an example of how you can visualize attrition rate using a pie chart in Python.

import matplotlib.pyplot as plt

# Creating a list of labels for the pie chart
labels = ['Completed observations', 'Missing observations']

# Creating a list of values for the pie chart
sizes = [total_individuals, missing_observations]

# Creation of the pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Attrition rate')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

This code will create a pie chart showing the proportion of completed and missing observations in your dataset.

Fixed effects and Random effects

Fixed effects and random effects are two fundamental concepts in mixed models used to analyze longitudinal data or data with hierarchical structure. These terms refer to how the variables in the model are considered and the nature of their relationship to the units of study.

Fixed effects:
Fixed effects are parameters that are assumed to be constant across the population and are estimated directly from the model. These effects represent the average effect of an independent variable on a dependent variable. For example, in a model that studies the effect of a treatment on an outcome, the treatment fixed effect would represent the average difference in the outcome between the treated group and the control group.

Random Effects:
Random effects are considered to be sampled from a probability distribution and are used to capture variation across study units. These effects represent individual deviations from the population average. For example, if we are studying the impact of a treatment on different individuals, random effects would capture individual differences in response to treatment, which cannot be explained by the independent variables in the model alone.

In short, fixed effects are parameters that are assumed to be constant across the population and focus on the average effects of the independent variables, while random effects capture individual variation across study units and provide information about how these units differ from the population average. Both effects are important to consider when analyzing longitudinal data to understand both average effects and individual variation over time.

Suppose we have a dataset representing the heights of children measured at different ages. We will use a mixed model to analyze this data, with fixed effects for age and random effects for each individual.

import pandas as pd
import statsmodels.api as sm

# Creating the DataFrame with the data
data = {
    'Individual': [1, 1, 1, 2, 2, 2],
    'Age': [2, 4, 6, 3, 5, 7],
    'Height': [85, 95, 105, 88, 98, 110]
}

df = pd.DataFrame(data)

# Defining the mixed effects model with fixed effects for age and random effects for individual
model = sm.MixedLM.from_formula("Height ~ Age", groups="Individual", data=df)
result = model.fit()

# Printing the model results
print(result.summary())

In this example, we are using the statsmodels MixedLM module to create a mixed model. The formula specified in the from_formula function indicates that we are modeling height as a function of age, with a fixed effect for age and a random effect for the individual. The Individual variable is specified as group to capture random effects specific to each individual. Finally, the model results are printed using the summary() method. This will give us information about the estimated coefficients, significance tests, and other model parameters.

         Mixed Linear Model Regression Results
========================================================
Model:             MixedLM  Dependent Variable:  Altezza
No. Observations:  6        Method:              REML   
No. Groups:        2        Scale:               0.5556 
Min. group size:   3        Log-Likelihood:      -7.7385
Max. group size:   3        Converged:           Yes    
Mean group size:   3.0                                  
--------------------------------------------------------
              Coef.  Std.Err.   z    P>|z| [0.025 0.975]
--------------------------------------------------------
Intercept     73.307    1.156 63.389 0.000 71.040 75.574
Età            5.228    0.188 27.738 0.000  4.859  5.597
Individuo Var  1.051    2.746                           
========================================================

Now let’s also implement a graphical representation.

# Creation of scatter plot with regression lines
sns.lmplot(x='Age', y='Height', data=df, hue='Individual', ci=None, scatter_kws={"s": 100})
plt.title('Fixed and random effects')
plt.xlabel('Age')
plt.ylabel('Height')
plt.show()

In this code, we are using Seaborn’s sns.lmplot to create a scatterplot with regression lines. Each individual is represented by a different color on the graph.

The regression lines show the fixed effects, while the scatter of points around the lines reflects the random effects. This provides a visual representation of how children’s height varies with age, taking into account both average effects and individual differences.

Analysis of Covariance (ANCOVA)

In the context of longitudinal data, analysis of covariance (ANCOVA) can be extended to take into account the longitudinal structure of the data and to evaluate differences between groups on a continuous dependent variable, while simultaneously controlling for the effect of continuous variables (covariates) on more points over time. This approach is often called longitudinal ANCOVA or time-varying ANCOVA.

Longitudinal ANCOVA considers the variability observed between participants over time and attempts to isolate the effects of interest, controlling for initial or pre-existing differences between groups and other variables that might influence the dependent variable over time.

The main procedure of longitudinal ANCOVA involves the specification of a model that incorporates the effects of groups of interest, control variables, and time. This model can be implemented using multivariate linear regression techniques, such as mixed effects models or generalized linear models.

Some key points to consider when analyzing covariance in longitudinal data include:

Understanding and managing these considerations are critical to obtaining valid and interpretable results in the analysis of covariance in longitudinal data.

Suppose we have a dataset similar to the one used in the previous examples, where we measured the height of children at different ages, and we want to evaluate whether age affects the height of children, controlling for a covariate, for example gender

import pandas as pd
import statsmodels.api as sm

# Creating the DataFrame with the data
data = {
    'Individual': [1, 1, 1, 1, 2, 2, 2, 2],
    'Age': [2, 4, 6, 8, 3, 5, 7, 9],
    'Height': [85, 95, 105, 111, 88, 98, 110, 123],
    'Gender': ['M', 'M', 'M', 'M', 'F', 'F', 'F', 'F']
}

df = pd.DataFrame(data)

# Defining the ANCOVA model
model = sm.OLS.from_formula("Height ~ Age + Gender", data=df)
result = model.fit()

# Printing the model results
print(result.summary())

In this example, we are using the statsmodels module to perform an ANCOVA analysis. In the model formula, we are specifying that we want to model height as a function of age and gender as a covariate. The fit() method will train the model and return the results.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Altezza   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.966
Method:                 Least Squares   F-statistic:                     99.27
Date:                Tue, 26 Mar 2024   Prob (F-statistic):           9.46e-05
Time:                        09:46:30   Log-Likelihood:                -16.380
No. Observations:                   8   AIC:                             38.76
Df Residuals:                       5   BIC:                             39.00
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      74.0000      2.543     29.095      0.000      67.462      80.538
Genere[T.M]    -0.6250      1.718     -0.364      0.731      -5.042       3.792
Età             5.1250      0.375     13.667      0.000       4.161       6.089
==============================================================================
Omnibus:                        0.297   Durbin-Watson:                   1.169
Prob(Omnibus):                  0.862   Jarque-Bera (JB):                0.367
Skew:                          -0.320   Prob(JB):                        0.832
Kurtosis:                       2.168   Cond. No.                         19.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

With this code, we will get ANCOVA results that will show whether age and gender have a significant effect on children’s height, controlling for each other.

An example of how to display ANCOVA results is to use a scatter plot with separate regression lines for each level of the categorical variable (for example, gender) and a different color for each level.

import seaborn as sns
import matplotlib.pyplot as plt

# Creation of scatter plot with regression lines for each gender level
sns.lmplot(x='Age', y='Height', hue='Gender', data=df, ci=None)
plt.title('ANCOVA: Height as a function of Age and Gender')
plt.xlabel('Age')
plt.ylabel('Height')
plt.show()

In this code, we are using Seaborn’s lmplot function to create a scatterplot with separate regression lines for each gender level. This will allow us to graphically visualize how height varies as a function of age, controlling for gender. The hue parameter is used to specify the categorical variable (gender) we want to separate and color the regression lines.

The Growth Model

Growth model is a type of statistical model used to describe and analyze the change of a variable over time. These models are commonly employed in longitudinal research, in which data are collected at multiple points in time for the same individual, group, or study unit.

Growth models are often used to examine and understand the processes of development, change and learning over time. They can be applied to a wide range of phenomena, including cognitive development, physical growth, behavioral changes, and many other areas of interest.

There are several types of growth models, including:

In summary, growth models are useful tools for understanding processes of change over time and for testing hypotheses regarding the factors that influence such changes. The choice of model depends on the nature of the data and the specific objectives of the research.

Other fundamental parameters in the study of longitudinal data

There are other key parameters that are used in the study of longitudinal data to provide a complete and in-depth understanding of the processes under investigation. Here are some of them:

These are just a few examples of important parameters when studying longitudinal data. The choice of parameters to use depends on the objective of the study, the nature of the data and the specific research questions

Exit mobile version