Sampling Methods in Python

Sampling methods with Python header

Sampling is a fundamental process in research and statistics, allowing meaningful conclusions to be drawn from a representative subset of a larger population. In this article, we will review the concept of sampling and the main methods used to select representative samples. Through practical examples in Python code and theoretical considerations, we will illustrate the importance of careful sample selection and the applications of different sampling methods.

Sampling

Sampling is the process of selecting a representative subset of a larger population to conduct a statistical analysis or research. Rather than collecting data from an entire population, which may be expensive, challenging, or even impossible, researchers select a sample of individuals or items that reflect characteristics of the population as a whole.

Sampling is fundamental in statistics because it allows us to make inferences about the larger population without the need to collect data from all its members. However, it is important that the sample is selected randomly or representatively to avoid bias and ensure that the conclusions drawn are valid for the entire population.

The Sample and the Population

The terms sample and population are often used, it is important to clarify these two concepts before moving forward, given that they are, in the context of statistics, two very distinct concepts.

The population represents the entire group of individuals, objects, events or other entities that we want to study or analyze. It is the complete and defined set of elements about which we intend to make statistical inferences. For example, if you are studying the height of students in a school, the population would be all students in that school.

A sample is a subset selected from the population. It is a group of individuals drawn from the population who are used to conduct a statistical analysis. The goal of sampling is to obtain a representative group of the population, so that valid inferences can be made about the entire population. For example, if you want to study the height of students in a school, a sample might consist of 100 students randomly selected from the school’s complete list of students.

In short, the main difference between population and sample is that the population represents the entire group of interest, while the sample is a selected subset from the population that is studied or analyzed to make inferences about the population itself.

But how do you know if the sample is representative of the population?

Determining whether a sample is truly representative of a population is critical to ensuring that conclusions drawn from the sample analysis are valid for the entire population. There are several methods and criteria that can be used to assess whether a sample is representative:

  • Randomness in Sampling: The sample must be randomly selected from the population. This means that each individual or element of the population has a known and equal probability of being selected to be part of the sample. Using random sampling techniques, such as simple random sampling or systematic sampling, can help ensure randomness in sample selection.
  • Sample Size: The sample should be large enough to capture the variability present in the population. A sample that is too small may not be representative of the diversity of the population, while a larger sample offers a better chance of representativeness.
  • Representativeness of Characteristics: The sample should reflect key characteristics of the population. For example, if the population is made up of 60% women and 40% men, the sample should proportionately reflect this gender split.
  • Absence of Bias in Sampling: It is important to avoid bias in the sampling process that could influence the results. For example, convenience sampling, where individuals are selected based on their availability or accessibility, could introduce a bias into the composition of the sample.
  • Statistical Evaluation: Statistical analyzes can be used to evaluate whether the sample is representative of the population. For example, comparing the demographic characteristics or other key variables of the sample with those of the population.
  • Comparison with Existing Literature: When possible, comparing the characteristics of the sample with those of other research or studies previously conducted on the same population can help evaluate the representativeness of the sample.

Overall, a combination of methods and criteria can be used to assess whether a sample is truly representative of a population. It is important to pay attention to these aspects during study design and data analysis in order to ensure the reliability and validity of the conclusions drawn from the sample analysis.

Sampling Methods

Sampling methods are designed to ensure that the sample is representative of the population of interest, and the choice of method depends on the nature of the research, available resources and other relevant factors. Accurate sampling is essential to ensure that the results obtained from the studies are reliable and generalizable to the reference population.

There are several sampling methods, and the choice of method depends on the objective of the study and the characteristics of the population. Here are some of the main sampling methods:

  • Simple random sampling
  • Systematic sampling
  • Stratified sampling
  • Cluster sampling
  • Convenient sampling
  • Quota sampling

The choice of method depends on the nature of the research and the resources available. It is important to select a method that minimizes the risk of bias and provides results that are representative of the population of interest. Let’s now see the different sampling methods with simple examples implemented in Python.

Let’s implement some sampling methods in Python: definition of a test population

In order to implement sampling methods with Python, it will first be necessary to create a population that simulates a real one as much as possible. In this regard, we will generate random values that will describe the characteristics of each element of the population. The number of elements should be enormous, but for simplicity of example we will limit ourselves to 100 subjects. In Python, a good way to contain a definition is via a pandas DataFrame.

import numpy as np
import pandas as pd

# Creation of a sample population
population = pd.DataFrame({
    'ID': range(1, 101),  # Unique identifiers for individuals
    'Age': np.random.randint(18, 70, size=100),  # Random age between 18 and 70 years
    'Gender': np.random.choice(['Male', 'Female'], size=100),  # Random gender
    'Income': np.random.normal(50000, 10000, size=100)  # Random income with mean 50000 and standard deviation 10000
})

population

The population variable is a dataframe and represents an imaginary population of 100 individuals, each with a unique identifier, a random age, a random gender, and a random income. This is just a summary example of a population that could be used for statistical analysis or simulation purposes. By running the previous code fragment, we will in fact obtain a dataframe with 100 different individuals distributed uniformly between the various possible ages (between 18 and 70 years) and with a “normally” distributed income value. The values contained within will vary from run to run. In my case I obtained the following result.

Sampling methods - population dataframe

We can also use a population visualization graph, using a bar histogram to see how this is distributed among the various age groups:

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram plot to visualize the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(population['Age'], bins=10, kde=True, color='skyblue')
plt.title('Age Distribution in the Population')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

By running you will get a graph similar to the following (the population is random and varies from run to run):

Sampling methods - population distribution

Now that we have a population to sample, we can move on to analyze the various sampling methods with a specific example for each of them.

Simple Random Sampling

Simple random sampling is one of the simplest and most fundamental sampling methods in statistics. It consists of randomly selecting a sample of individuals from the population without any type of stratification or subdivision. In other words, every individual in the population has the same probability of being selected to be part of the sample. This method is widely used because it is easy to implement and provides an unbiased estimate of population characteristics.

# Simple Random Sampling
def simple_random_sampling(population, n):
    return population.sample(n)

# Example of using simple random sampling to select 10 individuals
random_sample = simple_random_sampling(population, 10)
print("Simple Random Sampling:")
print(random_sample)

Running the code you get the following result:

Simple Random Sampling:
    ID  Age  Gender        Income
40  41   30    Male  38263.946005
17  18   55  Female  37423.210786
72  73   67  Female  47493.329812
77  78   39    Male  41936.878168
61  62   20  Female  61902.900527
18  19   27    Male  30213.753163
24  25   68    Male  45616.179535
62  63   27    Male  61020.302192
65  66   36  Female  42912.616841
63  64   26    Male  47544.117333

A good way to see how good the sample is compared to the population is through graphics. By superimposing the sample on the population, many things can be understood. We then implement the necessary code.

import matplotlib.pyplot as plt
import seaborn as sns

# Overlay histogram plot to visualize the age distribution in the population and the random sample
plt.figure(figsize=(10, 6))

# Age distribution in the population
sns.histplot(population['Age'], bins=10, kde=True, color='skyblue', label='Population')

# Age distribution in the random sample
sns.histplot(random_sample['Age'], bins=10, kde=True, color='salmon', label='Random Sample')

plt.title('Age Distribution: Population vs Random Sample')
plt.xlabel('Age')
plt.ylabel('Density')
plt.legend()

plt.show()

Running it you get a graph similar to the following:

Sampling methods - random simple sampling

Stratified Sampling

Stratified sampling is a sampling technique that divides the population into homogeneous groups called “strata”, and then selects a random sample from each of these strata. This method is used when the population has heterogeneity in some key characteristics, and the objective is to ensure that the sample accurately reflects this heterogeneity.

Initially, the population is divided into homogeneous groups or strata based on a characteristic or variable of interest. For example, if we are studying the income of individuals, we can divide the population into strata based on income groups such as low, medium, and high. After defining the strata, a random sample is selected from each of them. It is important that the selection of samples within each stratum occurs randomly, to ensure the representativeness of the overall sample. Once samples from each stratum are selected, they are combined to form the overall stratified sample.

Stratified sampling is useful when the population has significant variation in the characteristics of interest and when you want to ensure that the sample accurately reflects this diversity. It is particularly effective at reducing estimated variance and improving the accuracy of statistical estimates compared to simple random sampling, especially when strata are homogeneous within but heterogeneous between them.

For example, if we are conducting a study on the job satisfaction of a company’s employees, we might divide the population into strata based on seniority level (e.g., new hires, long-term employees, managers) and then select a random sample from each of these groups to form a stratified sample.

Let’s now move on to our example population and see how to apply stratified sampling in Python. In this case we will use age as an element to identify the layers.

# Stratified Sampling by Age
def stratified_sampling(population, n, stratification):
    sample = pd.DataFrame()
    for value, proportion in stratification.items():
        stratum = population[population['Age'] == value]
        if stratum.empty:  # Check if there are individuals in the given stratum
            print("There are no individuals in the population with age ", value)
            print("Please modify the population or change the age.")
            return sample
        else:
            # Check if there are enough individuals in the given stratum
            if len(stratum) < int(n * proportion):
                print("There are not enough individuals in the population with age ", value)
                return sample
            else:
                stratum_sample = stratum.sample(int(n * proportion))
                sample = pd.concat([sample, stratum_sample])
    return sample

# Example of using stratified sampling by age to select 20 individuals
age_stratification = {20: 0.2, 35: 0.3, 40: 0.2, 50: 0.1, 60: 0.2}  # Proportions for each age stratum 
stratified_age_sample = stratified_sampling(population, 10, age_stratification)
print("\nStratified Sampling by Age:")
print(stratified_age_sample)

Executing you get the following result:

Stratified Sampling by Age:
    ID  Age  Gender        Income
96  97   20  Female  53476.104206
61  62   20  Female  61902.900527
74  75   35  Female  39775.759287
93  94   35    Male  40767.701355
27  28   35  Female  59656.807053
25  26   40    Male  54748.113219
59  60   40    Male  38965.755561
11  12   50    Male  47480.883850
52  53   60  Female  41993.969218
69  70   60    Male  49140.256943

When using stratified sampling, the main goal is to ensure that the sample accurately reflects the distribution of key characteristics present in the population. While directly viewing a stratified sample relative to the entire population may not always be useful, there are other visualizations that can be helpful in analyzing and interpreting the results of stratified sampling.

For example, the simplest will be to extract from the population the strata taken into consideration in the sampling and compare them directly with the sample. In this case, we can compare similar items.

import matplotlib.pyplot as plt

# Calculate proportions in the population
population_proportions = {age: len(population[population['Age'] == age]) / len(population) for age in age_stratification.keys()}

# Calculate proportions in the sample
sample_proportions = {age: len(stratified_age_sample[stratified_age_sample['Age'] == age]) / len(stratified_age_sample) for age in age_stratification.keys()}

# Ages and proportion values
ages = list(age_stratification.keys())
population_props = list(population_proportions.values())
sample_props = list(sample_proportions.values())

# Bar width
bar_width = 0.35

# Bar positions on the plot
positions = range(len(ages))

# Create the bar plot
plt.figure(figsize=(10, 6))
plt.bar(positions, population_props, bar_width, label='Population', color='skyblue')
plt.bar([p + bar_width for p in positions], sample_props, bar_width, label='Sample', color='salmon')

# Labels and title
plt.xlabel('Age')
plt.ylabel('Proportion')
plt.title('Comparison of Proportions between Population and Sample by Age')
plt.xticks([p + bar_width / 2 for p in positions], ages)
plt.legend()

plt.show()

Running you get:

Sampling methods - stratified sampling

Systematic Sampling

Systematic sampling is a sampling technique in which individuals from the population are selected at regular intervals, using a systematic process. This method involves dividing the population into an ordered list and selecting every kth element from this list, where k is the so-called “sampling interval”.

The sampling interval, denoted k, represents the number of population elements between each selection. For example, if we have a population of 1000 individuals and we choose a sampling interval of 10, we will select every 10th individual to be part of the sample. The population is sorted based on a characteristic of interest. This characteristic can be any variable that allows you to assign an order to individuals in the population, such as a unique identifier or a numeric characteristic. After defining the sampling interval and sorting the population, the sample is selected by selecting every kth element from the sorted list. For example, if the sampling interval is 10, the first, eleventh, twenty-first, and so on will be selected until the desired sample is completed.

Systematic sampling is often used when the population is already sorted or when it is difficult to obtain a complete random sample. This technique is relatively simple to implement and can be efficient when the population is large and the distribution of individuals is uniform. However, it is important to note that systematic sampling can lead to the potential introduction of bias if the population order follows a pattern that does not accurately represent variation in the characteristic of interest. Therefore, it is advisable to perform a critical analysis of the results obtained through systematic sampling.

Also in this case we implement this sampling method on our population using Python code.

# Systematic Sampling
def systematic_sampling(population, n):
    k = len(population) // n  # Number of individuals in each interval
    start = np.random.randint(0, k)  # Random starting point
    sampled_indices = np.arange(start, len(population), step=k)
    return population.iloc[sampled_indices]

# Example of using systematic sampling to select 15 individuals
systematic_sample = systematic_sampling(population, 15)
print("\nSystematic Sampling:")
print(systematic_sample)

Running the code we will get a result similar to the following:

Systematic Sampling:
     ID  Age  Gender        Income
3     4   65  Female  47967.665387
9    10   39    Male  65416.036760
15   16   24  Female  43344.890257
21   22   54    Male  46390.444983
27   28   35  Female  59656.807053
33   34   28    Male  45416.342831
39   40   31    Male  37712.392099
45   46   51    Male  54424.359157
51   52   63    Male  50520.745717
57   58   62    Male  48836.335807
63   64   26    Male  47544.117333
69   70   60    Male  49140.256943
75   76   57  Female  52001.682813
81   82   19    Male  51900.536724
87   88   21    Male  47984.967190
93   94   35    Male  40767.701355
99  100   66    Male  68183.816243

In this case, graphically it is possible to follow the same approach we did with simple random sampling by directly comparing the sample with the reference population. The code will then be the same:

import matplotlib.pyplot as plt
import seaborn as sns

# Overlay histogram plot to visualize the age distribution in the population and in the systematic sample
plt.figure(figsize=(10, 6))

# Age distribution in the population
sns.histplot(population['Age'], bins=10, kde=True, color='skyblue', label='Population')

# Age distribution in the systematic sample
sns.histplot(systematic_sample['Age'], bins=10, kde=True, color='salmon', label='Systematic Sample')

plt.title('Age Distribution: Population vs Systematic Sample')
plt.xlabel('Age')
plt.ylabel('Density')
plt.legend()

plt.show()

Running this will give you a result similar to the following:

Sampling methods - systematic sampling

Cluster sampling

Cluster sampling is a sampling technique in which the population is divided into groups, called “cluster“, and a subset of these conglomerates is randomly selected to form the sample. This technique is useful when the population is naturally organized into groups or clusters and when it is not practical or possible to individually select elements of the population.

Clusters are homogeneous groups of individuals within the population. They can be defined based on geographic, social, or other factors that reflect the natural structure of the population. For example, if we are studying primary education in a certain geographic area, the clusters could be the schools in that area. After defining the clusters, a subset of them is randomly selected to form the sample. This selection is done using a random sampling technique, such as simple random sampling or systematic sampling. Once clusters are selected, additional sampling can be performed within each cluster to select specific individuals or elements to include in the sample. This can be done using other sampling techniques, such as simple or stratified random sampling.

Cluster sampling is particularly useful when the population is large and dispersed over a large geographic area or when it is expensive or difficult to individually select elements of the population. This technique allows you to simplify the sampling process by focusing on selecting representative groups of the population rather than individuals. However, it is important to keep in mind that conglomerate sampling may lead to lower precision than other sampling techniques, as individuals within each conglomerate may not be fully representative of the overall population. Therefore, it is important to carefully evaluate the trade-offs between efficiency and accuracy when using this sampling technique.

We use this sampling method on our population:

# Cluster Sampling
def cluster_sampling(population, n, num_clusters):
    clusters = np.array_split(population, num_clusters)
    sample = pd.concat([cluster.sample(1) for cluster in clusters], axis=0)
    return sample.head(n)

# Example of using cluster sampling to select 10 individuals
cluster_sample = cluster_sampling(population, 10, 5)
print("\nCluster Sampling:")
print(cluster_sample)

Executing you get the following result:

Cluster Sampling:
     ID  Age  Gender        Income
6     7   37    Male  47812.674648
25   26   40    Male  54748.113219
41   42   44    Male  50473.603034
78   79   53  Female  57504.398435
99  100   66    Male  68183.816243

Anche qui si segue lo stesso approccio grafico precedente:

Sampling methods - cluster sampling

Quota Sampling

Quota sampling is a non-probability sampling technique in which the population is divided into groups, called “quotas“, based on certain characteristics of interest. Subsequently, individuals are selected from each quota until a predetermined number per quota is reached. This method is used to ensure that the sample reflects the specified proportions of the characteristics of interest present in the population, but does not guarantee randomness in the selection of individuals.

Quotas are subdivisions of the population based on specific demographic or socio-economic characteristics, such as age, gender, education level, income, etc. These shares are selected to reflect the desired proportions of each characteristic within the population. After defining the quotas, individuals are nonrandomly selected from each quota until the predetermined number of individuals per quota is reached. The selection of individuals can be done in various ways, for example using contact lists, street interviews or via telephone calls. Once individuals are selected from each quota, they are combined to form the overall sample. Because individuals were selected from each quota to ensure that the sample reflects the desired proportions of the characteristics of interest, it is hoped that the sample is representative of the overall population.

Quota sampling is often used when it is not possible to use probability sampling techniques, such as simple random sampling, and when it is necessary to ensure that the sample reflects certain characteristics of the population. However, it is important to note that quota sampling can introduce bias if quotas are not carefully selected or if the selection of individuals within each quota is not random. Therefore, it is crucial to pay attention to the design and implementation of quota sampling to ensure the representativeness and reliability of the resulting sample.

Let’s see it applied on our population:

# Quota Sampling
def quota_sampling(population, quotas):
    sample = pd.DataFrame()
    for attribute, value in quotas.items():
        subset = population[population[attribute] == value]
        sample = pd.concat([sample, subset.sample(frac=0.5)])  # Example: selecting 50% of cases for each quota
    return sample

# Example of using quota sampling to select 15 individuals
selected_quotas = {'Gender': 'Male', 'Age': 30}  # Example of selected quotas
quota_sample = quota_sampling(population, selected_quotas)
print("\nQuota Sampling:")
print(quota_sample)

Executing you get the following result:

Quota Sampling:
    ID  Age  Gender        Income
77  78   39    Male  41936.878168
11  12   50    Male  47480.883850
0    1   61    Male  84271.341233
57  58   62    Male  48836.335807
19  20   22    Male  49512.773535
30  31   48    Male  27854.862920
76  77   49    Male  43139.035090
20  21   26    Male  45206.345011
58  59   33    Male  56785.855448
21  22   54    Male  46390.444983
22  23   44    Male  40192.708853
2    3   67    Male  61312.382194
24  25   68    Male  45616.179535
69  70   60    Male  49140.256943
29  30   33    Male  44880.553683
71  72   47    Male  44980.757012
9   10   39    Male  65416.036760
47  48   50    Male  65845.099781
6    7   37    Male  47812.674648
33  34   28    Male  45416.342831
45  46   51    Male  54424.359157
48  49   37    Male  51681.971617
87  88   21    Male  47984.967190
56  57   43    Male  41276.614158
43  44   23    Male  55617.835136
82  83   30  Female  44631.651892

Even in this case you can follow the same graphical approach.

Sampling methods - quota sampling

Opportunistic or Convenience Sampling

Opportunistic sampling, also known as convenience sampling, is a non-probability sampling technique in which individuals are selected based on their availability and accessibility. In this type of sampling, there is no concern for ensuring a random or statistical representation of the population, but rather individuals are selected who are easily accessible or available for study.

Individuals are selected based on their availability and accessibility to the researcher or organization that is conducting the study. This can include people who are easy to contact, reach, or involve in the research process. Since the selection of individuals occurs based on their availability and accessibility, opportunistic sampling does not follow a random process. As a result, results obtained from this type of sampling may not be representative of the larger population. Opportunistic sampling is often used when it is difficult or expensive to perform random sampling or when it is not possible to obtain a representative sample of the population. This technique is common in situations where researchers have access to only a small portion of the population or when studying specific groups of individuals within a larger population.

Opportunistic sampling can be useful when you want to gain preliminary information about a topic or when you want to quickly explore a phenomenon without engaging in more rigorous sampling. However, it is important to note that the results obtained from this type of sampling may be influenced by uncontrolled variables and may not be generalizable to the broader population. Therefore, it is critical to interpret the results of opportunistic sampling with caution and consider its limitations in the context of the analysis conducted.

# Convenience Sampling
def convenience_sampling(population, n):
    return population.sample(n)

# Example of using convenience sampling to select 12 individuals
convenience_sample = convenience_sampling(population, 12)
print("\nConvenience Sampling:")
print(convenience_sample)

Executing you get the following result:

Convenience Sampling:
    ID  Age  Gender        Income
40  41   30    Male  38263.946005
18  19   27    Male  30213.753163
57  58   62    Male  48836.335807
3    4   65  Female  47967.665387
20  21   26    Male  45206.345011
17  18   55  Female  37423.210786
23  24   69  Female  56542.547788
67  68   49  Female  49888.602757
48  49   37    Male  51681.971617
87  88   21    Male  47984.967190
69  70   60    Male  49140.256943
91  92   51  Female  57087.236066

Graphically we can have a situation like the following:

Sampling methods - convenience sampling

Leave a Reply