Sampling is a fundamental process in research and statistics, allowing meaningful conclusions to be drawn from a representative subset of a larger population. In this article, we will review the concept of sampling and the main methods used to select representative samples. Through practical examples in Python code and theoretical considerations, we will illustrate the importance of careful sample selection and the applications of different sampling methods.
Sampling
Sampling is the process of selecting a representative subset of a larger population to conduct a statistical analysis or research. Rather than collecting data from an entire population, which may be expensive, challenging, or even impossible, researchers select a sample of individuals or items that reflect characteristics of the population as a whole.
Sampling is fundamental in statistics because it allows us to make inferences about the larger population without the need to collect data from all its members. However, it is important that the sample is selected randomly or representatively to avoid bias and ensure that the conclusions drawn are valid for the entire population.
The Sample and the Population
The terms sample and population are often used, it is important to clarify these two concepts before moving forward, given that they are, in the context of statistics, two very distinct concepts.
The population represents the entire group of individuals, objects, events or other entities that we want to study or analyze. It is the complete and defined set of elements about which we intend to make statistical inferences. For example, if you are studying the height of students in a school, the population would be all students in that school.
A sample is a subset selected from the population. It is a group of individuals drawn from the population who are used to conduct a statistical analysis. The goal of sampling is to obtain a representative group of the population, so that valid inferences can be made about the entire population. For example, if you want to study the height of students in a school, a sample might consist of 100 students randomly selected from the school’s complete list of students.
In short, the main difference between population and sample is that the population represents the entire group of interest, while the sample is a selected subset from the population that is studied or analyzed to make inferences about the population itself.
But how do you know if the sample is representative of the population?
Determining whether a sample is truly representative of a population is critical to ensuring that conclusions drawn from the sample analysis are valid for the entire population. There are several methods and criteria that can be used to assess whether a sample is representative:
- Randomness in Sampling: The sample must be randomly selected from the population. This means that each individual or element of the population has a known and equal probability of being selected to be part of the sample. Using random sampling techniques, such as simple random sampling or systematic sampling, can help ensure randomness in sample selection.
- Sample Size: The sample should be large enough to capture the variability present in the population. A sample that is too small may not be representative of the diversity of the population, while a larger sample offers a better chance of representativeness.
- Representativeness of Characteristics: The sample should reflect key characteristics of the population. For example, if the population is made up of 60% women and 40% men, the sample should proportionately reflect this gender split.
- Absence of Bias in Sampling: It is important to avoid bias in the sampling process that could influence the results. For example, convenience sampling, where individuals are selected based on their availability or accessibility, could introduce a bias into the composition of the sample.
- Statistical Evaluation: Statistical analyzes can be used to evaluate whether the sample is representative of the population. For example, comparing the demographic characteristics or other key variables of the sample with those of the population.
- Comparison with Existing Literature: When possible, comparing the characteristics of the sample with those of other research or studies previously conducted on the same population can help evaluate the representativeness of the sample.
Overall, a combination of methods and criteria can be used to assess whether a sample is truly representative of a population. It is important to pay attention to these aspects during study design and data analysis in order to ensure the reliability and validity of the conclusions drawn from the sample analysis.
Sampling Methods
Sampling methods are designed to ensure that the sample is representative of the population of interest, and the choice of method depends on the nature of the research, available resources and other relevant factors. Accurate sampling is essential to ensure that the results obtained from the studies are reliable and generalizable to the reference population.
There are several sampling methods, and the choice of method depends on the objective of the study and the characteristics of the population. Here are some of the main sampling methods:
- Simple random sampling
- Systematic sampling
- Stratified sampling
- Cluster sampling
- Convenient sampling
- Quota sampling
The choice of method depends on the nature of the research and the resources available. It is important to select a method that minimizes the risk of bias and provides results that are representative of the population of interest. Let’s now see the different sampling methods with simple examples implemented in Python.
Let’s implement some sampling methods in Python: definition of a test population
In order to implement sampling methods with Python, it will first be necessary to create a population that simulates a real one as much as possible. In this regard, we will generate random values that will describe the characteristics of each element of the population. The number of elements should be enormous, but for simplicity of example we will limit ourselves to 100 subjects. In Python, a good way to contain a definition is via a pandas DataFrame.
import numpy as np
import pandas as pd
# Creation of a sample population
population = pd.DataFrame({
'ID': range(1, 101), # Unique identifiers for individuals
'Age': np.random.randint(18, 70, size=100), # Random age between 18 and 70 years
'Gender': np.random.choice(['Male', 'Female'], size=100), # Random gender
'Income': np.random.normal(50000, 10000, size=100) # Random income with mean 50000 and standard deviation 10000
})
population
The population
variable is a dataframe and represents an imaginary population of 100 individuals, each with a unique identifier, a random age, a random gender, and a random income. This is just a summary example of a population that could be used for statistical analysis or simulation purposes. By running the previous code fragment, we will in fact obtain a dataframe with 100 different individuals distributed uniformly between the various possible ages (between 18 and 70 years) and with a “normally” distributed income value. The values contained within will vary from run to run. In my case I obtained the following result.
We can also use a population visualization graph, using a bar histogram to see how this is distributed among the various age groups:
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram plot to visualize the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(population['Age'], bins=10, kde=True, color='skyblue')
plt.title('Age Distribution in the Population')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
By running you will get a graph similar to the following (the population is random and varies from run to run):
Now that we have a population to sample, we can move on to analyze the various sampling methods with a specific example for each of them.
Simple Random Sampling
Simple random sampling is one of the simplest and most fundamental sampling methods in statistics. It consists of randomly selecting a sample of individuals from the population without any type of stratification or subdivision. In other words, every individual in the population has the same probability of being selected to be part of the sample. This method is widely used because it is easy to implement and provides an unbiased estimate of population characteristics.
# Simple Random Sampling
def simple_random_sampling(population, n):
return population.sample(n)
# Example of using simple random sampling to select 10 individuals
random_sample = simple_random_sampling(population, 10)
print("Simple Random Sampling:")
print(random_sample)
Running the code you get the following result:
Simple Random Sampling:
ID Age Gender Income
40 41 30 Male 38263.946005
17 18 55 Female 37423.210786
72 73 67 Female 47493.329812
77 78 39 Male 41936.878168
61 62 20 Female 61902.900527
18 19 27 Male 30213.753163
24 25 68 Male 45616.179535
62 63 27 Male 61020.302192
65 66 36 Female 42912.616841
63 64 26 Male 47544.117333
A good way to see how good the sample is compared to the population is through graphics. By superimposing the sample on the population, many things can be understood. We then implement the necessary code.
import matplotlib.pyplot as plt
import seaborn as sns
# Overlay histogram plot to visualize the age distribution in the population and the random sample
plt.figure(figsize=(10, 6))
# Age distribution in the population
sns.histplot(population['Age'], bins=10, kde=True, color='skyblue', label='Population')
# Age distribution in the random sample
sns.histplot(random_sample['Age'], bins=10, kde=True, color='salmon', label='Random Sample')
plt.title('Age Distribution: Population vs Random Sample')
plt.xlabel('Age')
plt.ylabel('Density')
plt.legend()
plt.show()
Running it you get a graph similar to the following:
Stratified Sampling
Stratified sampling is a sampling technique that divides the population into homogeneous groups called “strata”, and then selects a random sample from each of these strata. This method is used when the population has heterogeneity in some key characteristics, and the objective is to ensure that the sample accurately reflects this heterogeneity.
Initially, the population is divided into homogeneous groups or strata based on a characteristic or variable of interest. For example, if we are studying the income of individuals, we can divide the population into strata based on income groups such as low, medium, and high. After defining the strata, a random sample is selected from each of them. It is important that the selection of samples within each stratum occurs randomly, to ensure the representativeness of the overall sample. Once samples from each stratum are selected, they are combined to form the overall stratified sample.
Stratified sampling is useful when the population has significant variation in the characteristics of interest and when you want to ensure that the sample accurately reflects this diversity. It is particularly effective at reducing estimated variance and improving the accuracy of statistical estimates compared to simple random sampling, especially when strata are homogeneous within but heterogeneous between them.
For example, if we are conducting a study on the job satisfaction of a company’s employees, we might divide the population into strata based on seniority level (e.g., new hires, long-term employees, managers) and then select a random sample from each of these groups to form a stratified sample.
Let’s now move on to our example population and see how to apply stratified sampling in Python. In this case we will use age as an element to identify the layers.
# Stratified Sampling by Age
def stratified_sampling(population, n, stratification):
sample = pd.DataFrame()
for value, proportion in stratification.items():
stratum = population[population['Age'] == value]
if stratum.empty: # Check if there are individuals in the given stratum
print("There are no individuals in the population with age ", value)
print("Please modify the population or change the age.")
return sample
else:
# Check if there are enough individuals in the given stratum
if len(stratum) < int(n * proportion):
print("There are not enough individuals in the population with age ", value)
return sample
else:
stratum_sample = stratum.sample(int(n * proportion))
sample = pd.concat([sample, stratum_sample])
return sample
# Example of using stratified sampling by age to select 20 individuals
age_stratification = {20: 0.2, 35: 0.3, 40: 0.2, 50: 0.1, 60: 0.2} # Proportions for each age stratum
stratified_age_sample = stratified_sampling(population, 10, age_stratification)
print("\nStratified Sampling by Age:")
print(stratified_age_sample)
Executing you get the following result:
Stratified Sampling by Age:
ID Age Gender Income
96 97 20 Female 53476.104206
61 62 20 Female 61902.900527
74 75 35 Female 39775.759287
93 94 35 Male 40767.701355
27 28 35 Female 59656.807053
25 26 40 Male 54748.113219
59 60 40 Male 38965.755561
11 12 50 Male 47480.883850
52 53 60 Female 41993.969218
69 70 60 Male 49140.256943
When using stratified sampling, the main goal is to ensure that the sample accurately reflects the distribution of key characteristics present in the population. While directly viewing a stratified sample relative to the entire population may not always be useful, there are other visualizations that can be helpful in analyzing and interpreting the results of stratified sampling.
For example, the simplest will be to extract from the population the strata taken into consideration in the sampling and compare them directly with the sample. In this case, we can compare similar items.
import matplotlib.pyplot as plt
# Calculate proportions in the population
population_proportions = {age: len(population[population['Age'] == age]) / len(population) for age in age_stratification.keys()}
# Calculate proportions in the sample
sample_proportions = {age: len(stratified_age_sample[stratified_age_sample['Age'] == age]) / len(stratified_age_sample) for age in age_stratification.keys()}
# Ages and proportion values
ages = list(age_stratification.keys())
population_props = list(population_proportions.values())
sample_props = list(sample_proportions.values())
# Bar width
bar_width = 0.35
# Bar positions on the plot
positions = range(len(ages))
# Create the bar plot
plt.figure(figsize=(10, 6))
plt.bar(positions, population_props, bar_width, label='Population', color='skyblue')
plt.bar([p + bar_width for p in positions], sample_props, bar_width, label='Sample', color='salmon')
# Labels and title
plt.xlabel('Age')
plt.ylabel('Proportion')
plt.title('Comparison of Proportions between Population and Sample by Age')
plt.xticks([p + bar_width / 2 for p in positions], ages)
plt.legend()
plt.show()
Running you get:
Systematic Sampling
Systematic sampling is a sampling technique in which individuals from the population are selected at regular intervals, using a systematic process. This method involves dividing the population into an ordered list and selecting every kth element from this list, where k is the so-called “sampling interval”.
The sampling interval, denoted k, represents the number of population elements between each selection. For example, if we have a population of 1000 individuals and we choose a sampling interval of 10, we will select every 10th individual to be part of the sample. The population is sorted based on a characteristic of interest. This characteristic can be any variable that allows you to assign an order to individuals in the population, such as a unique identifier or a numeric characteristic. After defining the sampling interval and sorting the population, the sample is selected by selecting every kth element from the sorted list. For example, if the sampling interval is 10, the first, eleventh, twenty-first, and so on will be selected until the desired sample is completed.
Systematic sampling is often used when the population is already sorted or when it is difficult to obtain a complete random sample. This technique is relatively simple to implement and can be efficient when the population is large and the distribution of individuals is uniform. However, it is important to note that systematic sampling can lead to the potential introduction of bias if the population order follows a pattern that does not accurately represent variation in the characteristic of interest. Therefore, it is advisable to perform a critical analysis of the results obtained through systematic sampling.
Also in this case we implement this sampling method on our population using Python code.
# Systematic Sampling
def systematic_sampling(population, n):
k = len(population) // n # Number of individuals in each interval
start = np.random.randint(0, k) # Random starting point
sampled_indices = np.arange(start, len(population), step=k)
return population.iloc[sampled_indices]
# Example of using systematic sampling to select 15 individuals
systematic_sample = systematic_sampling(population, 15)
print("\nSystematic Sampling:")
print(systematic_sample)
Running the code we will get a result similar to the following:
Systematic Sampling:
ID Age Gender Income
3 4 65 Female 47967.665387
9 10 39 Male 65416.036760
15 16 24 Female 43344.890257
21 22 54 Male 46390.444983
27 28 35 Female 59656.807053
33 34 28 Male 45416.342831
39 40 31 Male 37712.392099
45 46 51 Male 54424.359157
51 52 63 Male 50520.745717
57 58 62 Male 48836.335807
63 64 26 Male 47544.117333
69 70 60 Male 49140.256943
75 76 57 Female 52001.682813
81 82 19 Male 51900.536724
87 88 21 Male 47984.967190
93 94 35 Male 40767.701355
99 100 66 Male 68183.816243
In this case, graphically it is possible to follow the same approach we did with simple random sampling by directly comparing the sample with the reference population. The code will then be the same:
import matplotlib.pyplot as plt
import seaborn as sns
# Overlay histogram plot to visualize the age distribution in the population and in the systematic sample
plt.figure(figsize=(10, 6))
# Age distribution in the population
sns.histplot(population['Age'], bins=10, kde=True, color='skyblue', label='Population')
# Age distribution in the systematic sample
sns.histplot(systematic_sample['Age'], bins=10, kde=True, color='salmon', label='Systematic Sample')
plt.title('Age Distribution: Population vs Systematic Sample')
plt.xlabel('Age')
plt.ylabel('Density')
plt.legend()
plt.show()
Running this will give you a result similar to the following:
Cluster sampling
Cluster sampling is a sampling technique in which the population is divided into groups, called “cluster“, and a subset of these conglomerates is randomly selected to form the sample. This technique is useful when the population is naturally organized into groups or clusters and when it is not practical or possible to individually select elements of the population.
Clusters are homogeneous groups of individuals within the population. They can be defined based on geographic, social, or other factors that reflect the natural structure of the population. For example, if we are studying primary education in a certain geographic area, the clusters could be the schools in that area. After defining the clusters, a subset of them is randomly selected to form the sample. This selection is done using a random sampling technique, such as simple random sampling or systematic sampling. Once clusters are selected, additional sampling can be performed within each cluster to select specific individuals or elements to include in the sample. This can be done using other sampling techniques, such as simple or stratified random sampling.
Cluster sampling is particularly useful when the population is large and dispersed over a large geographic area or when it is expensive or difficult to individually select elements of the population. This technique allows you to simplify the sampling process by focusing on selecting representative groups of the population rather than individuals. However, it is important to keep in mind that conglomerate sampling may lead to lower precision than other sampling techniques, as individuals within each conglomerate may not be fully representative of the overall population. Therefore, it is important to carefully evaluate the trade-offs between efficiency and accuracy when using this sampling technique.
We use this sampling method on our population:
# Cluster Sampling
def cluster_sampling(population, n, num_clusters):
clusters = np.array_split(population, num_clusters)
sample = pd.concat([cluster.sample(1) for cluster in clusters], axis=0)
return sample.head(n)
# Example of using cluster sampling to select 10 individuals
cluster_sample = cluster_sampling(population, 10, 5)
print("\nCluster Sampling:")
print(cluster_sample)
Executing you get the following result:
Cluster Sampling:
ID Age Gender Income
6 7 37 Male 47812.674648
25 26 40 Male 54748.113219
41 42 44 Male 50473.603034
78 79 53 Female 57504.398435
99 100 66 Male 68183.816243
Anche qui si segue lo stesso approccio grafico precedente:
Quota Sampling
Quota sampling is a non-probability sampling technique in which the population is divided into groups, called “quotas“, based on certain characteristics of interest. Subsequently, individuals are selected from each quota until a predetermined number per quota is reached. This method is used to ensure that the sample reflects the specified proportions of the characteristics of interest present in the population, but does not guarantee randomness in the selection of individuals.
Quotas are subdivisions of the population based on specific demographic or socio-economic characteristics, such as age, gender, education level, income, etc. These shares are selected to reflect the desired proportions of each characteristic within the population. After defining the quotas, individuals are nonrandomly selected from each quota until the predetermined number of individuals per quota is reached. The selection of individuals can be done in various ways, for example using contact lists, street interviews or via telephone calls. Once individuals are selected from each quota, they are combined to form the overall sample. Because individuals were selected from each quota to ensure that the sample reflects the desired proportions of the characteristics of interest, it is hoped that the sample is representative of the overall population.
Quota sampling is often used when it is not possible to use probability sampling techniques, such as simple random sampling, and when it is necessary to ensure that the sample reflects certain characteristics of the population. However, it is important to note that quota sampling can introduce bias if quotas are not carefully selected or if the selection of individuals within each quota is not random. Therefore, it is crucial to pay attention to the design and implementation of quota sampling to ensure the representativeness and reliability of the resulting sample.
Let’s see it applied on our population:
# Quota Sampling
def quota_sampling(population, quotas):
sample = pd.DataFrame()
for attribute, value in quotas.items():
subset = population[population[attribute] == value]
sample = pd.concat([sample, subset.sample(frac=0.5)]) # Example: selecting 50% of cases for each quota
return sample
# Example of using quota sampling to select 15 individuals
selected_quotas = {'Gender': 'Male', 'Age': 30} # Example of selected quotas
quota_sample = quota_sampling(population, selected_quotas)
print("\nQuota Sampling:")
print(quota_sample)
Executing you get the following result:
Quota Sampling:
ID Age Gender Income
77 78 39 Male 41936.878168
11 12 50 Male 47480.883850
0 1 61 Male 84271.341233
57 58 62 Male 48836.335807
19 20 22 Male 49512.773535
30 31 48 Male 27854.862920
76 77 49 Male 43139.035090
20 21 26 Male 45206.345011
58 59 33 Male 56785.855448
21 22 54 Male 46390.444983
22 23 44 Male 40192.708853
2 3 67 Male 61312.382194
24 25 68 Male 45616.179535
69 70 60 Male 49140.256943
29 30 33 Male 44880.553683
71 72 47 Male 44980.757012
9 10 39 Male 65416.036760
47 48 50 Male 65845.099781
6 7 37 Male 47812.674648
33 34 28 Male 45416.342831
45 46 51 Male 54424.359157
48 49 37 Male 51681.971617
87 88 21 Male 47984.967190
56 57 43 Male 41276.614158
43 44 23 Male 55617.835136
82 83 30 Female 44631.651892
Even in this case you can follow the same graphical approach.
Opportunistic or Convenience Sampling
Opportunistic sampling, also known as convenience sampling, is a non-probability sampling technique in which individuals are selected based on their availability and accessibility. In this type of sampling, there is no concern for ensuring a random or statistical representation of the population, but rather individuals are selected who are easily accessible or available for study.
Individuals are selected based on their availability and accessibility to the researcher or organization that is conducting the study. This can include people who are easy to contact, reach, or involve in the research process. Since the selection of individuals occurs based on their availability and accessibility, opportunistic sampling does not follow a random process. As a result, results obtained from this type of sampling may not be representative of the larger population. Opportunistic sampling is often used when it is difficult or expensive to perform random sampling or when it is not possible to obtain a representative sample of the population. This technique is common in situations where researchers have access to only a small portion of the population or when studying specific groups of individuals within a larger population.
Opportunistic sampling can be useful when you want to gain preliminary information about a topic or when you want to quickly explore a phenomenon without engaging in more rigorous sampling. However, it is important to note that the results obtained from this type of sampling may be influenced by uncontrolled variables and may not be generalizable to the broader population. Therefore, it is critical to interpret the results of opportunistic sampling with caution and consider its limitations in the context of the analysis conducted.
# Convenience Sampling
def convenience_sampling(population, n):
return population.sample(n)
# Example of using convenience sampling to select 12 individuals
convenience_sample = convenience_sampling(population, 12)
print("\nConvenience Sampling:")
print(convenience_sample)
Executing you get the following result:
Convenience Sampling:
ID Age Gender Income
40 41 30 Male 38263.946005
18 19 27 Male 30213.753163
57 58 62 Male 48836.335807
3 4 65 Female 47967.665387
20 21 26 Male 45206.345011
17 18 55 Female 37423.210786
23 24 69 Female 56542.547788
67 68 49 Female 49888.602757
48 49 37 Male 51681.971617
87 88 21 Male 47984.967190
69 70 60 Male 49140.256943
91 92 51 Female 57087.236066
Graphically we can have a situation like the following: