Site icon Meccanismo Complesso

How to generate specific datasets for clustering with Scikit-learn

Dataset for clustering
Dataset for clustering header

Scikit-learn, one of the most popular libraries for machine learning in Python, offers several functions for generating datasets suitable for many clustering purposes. These functions allow you to create summary datasets, which are artificially created with the specific goal of being used to perform clustering operations and to evaluate the performance of clustering algorithms.

The datasets that can be generated by Scikit-learn

The Scikit-learn library provides a series of functions that allows you to simply and automatically generate a series of datasets suitable for clustering studies. Each function allows you to generate distributions of points in a dataset with particular characteristics which generates natural clusters with peculiar shapes, specific for each case. Let’s see a list of these functions:

These datasets can be used to test clustering algorithms or to perform clustering experiments in controlled environments. An important notification is that some of these functions generate labeled datasets, i.e. with in addition (returned y value) the cluster membership labels. Let’s see how the function group is divided

Labeled datasets:

Unlabeled datasets:

This allows you to choose the right function based on your clustering or classification needs.

The generation of summary datasets for clustering with scikit-learn

Now let’s see how we can generate these summary datasets in code. Their implementation is really very simple and consists of just one function call. Let’s write the code

from sklearn.datasets import make_blobs

# Generate a dataset of 1000 points distributed in 5 clusters
# with standard deviation (cluster_std) set to 1.0
X, y = make_blobs(n_samples=1000, centers=5, cluster_std=1.0, random_state=42)

cluster has a standard deviation (variability) of the points equal to 1.0. The points generated are represented in a scatter plot where each color represents a class to which it belongs.

Here is an explanation of the parameters used in the make_blobs function:

To be able to visualize the distribution of the dataset and the clusters built in it, we can use the matplotlib library. So let’s write the following code

import matplotlib.pyplot as plt

# Plot the generated points
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.title('Dataset of Isotropic Blobs')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')
plt.grid(True)
plt.show()

Running the code we get the following representation:

The datasets generated by Scikit-learn for clustering were created with the goal of providing users with versatile tools to explore and understand how clustering algorithms work. These datasets are useful in different contexts.

Isotropic Blob Dataset

We have just seen how Scikit-learn’s make_blobs function generates a synthetic dataset composed of clusters of points, commonly called “blobs”. These blobs are randomly distributed in the feature space according to an isotropic Gaussian distribution, meaning that the variance is the same in all directions.

This type of dataset is useful for testing and evaluating clustering algorithms, as it offers precise control over cluster parameters. It is particularly useful for evaluating clustering algorithms that require spherical or isotropic clusters.

Moon-Shaped Dataset

Scikit-learn’s make_moons function generates a synthetic dataset composed of two overlapping semicircles. Here are some main characteristics of a dataset generated with make_moons:

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

X, y = make_moons(n_samples=1000, noise=0.1, random_state=42)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.title('Moon-shaped dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Classe')
plt.grid(True)
plt.show()

Running we obtain the following dataset:

Compared to other datasets generated with functions such as make_blobs, make_circles, make_s_curve, etc., the dataset generated with make_moons has a more complex and non-linear shape. It is particularly useful for testing clustering algorithms in scenarios where clusters may have non-standard or non-linear shapes. For example, make_moons is useful for testing clustering algorithms that need to identify clusters that cannot be separated by simple hyperplanes, such as the case of linear separation that a k-means-based clustering algorithm might have to deal with.

Circle-shaped Dataset

The dataset generated by Scikit-learn’s make_circles function is designed to simulate a set of data that has a circular or ring-shaped structure. This dataset has the following characteristics:

Circular structure: The dataset contains points distributed in concentric circles, similar to a target. This circular structure makes it useful for testing clustering algorithms that need to identify groups of data arranged in a circular or ring-shaped pattern.

Noise: The make_circles function allows you to add a controlled level of noise to the generated data. This helps make the dataset more realistic and suitable for testing the effectiveness of clustering algorithms in the presence of noise in the data.

factor parameter: The factor parameter allows you to adjust the distance between the generated concentric circles. This allows you to create datasets with different levels of separation between data groups, allowing you to test the effectiveness of clustering algorithms in contexts with different spatial distributions of data.

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

# Generate the circle-shaped dataset
X, y = make_circles(n_samples=1000, noise=0.1, factor=0.5)

# Visualize the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.title("Circle-shaped Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()

Here is an explanation of the parameters used in the make_circles function:

The dataset generated by make_circles is commonly used to test and evaluate clustering algorithms that are capable of identifying and distinguishing groups of data arranged in a circular or ring-shaped manner. For example, algorithms such as agglomerative clustering, K-means or DBSCAN can be tested on this dataset to evaluate their ability to identify and distinguish the concentric circles or ring-shaped structure of the data.

Compared to other datasets generated by similar functions, such as make_blobs or make_moons, the dataset generated by make_circles has a different structure and can be used to test clustering algorithms on circular or ring-shaped data patterns, rather than on linear or globular data sets .

Gaussian Quantiles Dataset

The dataset generated by make_gaussian_quantiles is composed of samples distributed according to a multivariate Gaussian distribution. This means that the points in the dataset are arranged in clusters that follow a normal distribution in multiple dimensions. The main characteristics of this dataset include:

from sklearn.datasets import make_gaussian_quantiles
import matplotlib.pyplot as plt

# Generate a dataset with 2 Gaussian clusters
X, y = make_gaussian_quantiles(n_samples=1000, n_features=2, n_classes=2)

# Plot the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolors='k')
plt.title('Gaussian Quantiles Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')
plt.grid(True)
plt.show()

Let’s analyze the parameters used to control the generation of the dataset via the make_gaussian_quantiles function:

So, in the context of the example, make_gaussian_quantiles(n_samples=1000, n_features=2, n_classes=2) indicates that we are generating a dataset with 1000 samples, each with two features, divided into two distinct clusters.

This type of dataset is often used in clustering to test algorithms and evaluate their performance. Compared to other datasets generated by similar make_ functions in Scikit-learn, such as make_blobs or make_moons, the dataset generated by make_gaussian_quantiles may be more suitable when you want to test clustering algorithms that are effective on data with Gaussian distributions or when you want to create a dataset with more complex and overlapping clusters. For example, in situations where clusters are not clearly separated, this feature can generate clusters with controlled overlaps.

S-Curve Dataset (3D and Unlabeled)

Il dataset generato da make_s_curve è un insieme di dati sintetici che segue una forma curva “S” nello spazio tridimensionale. Questo dataset è principalmente caratterizzato dalle seguenti proprietà:

from sklearn.datasets import make_s_curve
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate the "S" shaped dataset with 1000 points
X, t = make_s_curve(n_samples=1000, noise=0.1, random_state=42)

# Plot the dataset
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap='viridis')
ax.set_title('S-Curve Dataset')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

This type of dataset can be used in clustering to test algorithms that can deal with data with nonlinear structure. Unlike datasets generated by make_ functions such as make_blobs or make_moons, make_s_curve offers a more complex and nonlinear structure, which can be useful for evaluating the ability of clustering algorithms to detect and handle such complexity. For example, density-based clustering algorithms or nonlinear algorithms such as DBSCAN or t-SNE could be tested on this type of dataset to see how they perform against nonlinear clusters.

Swiss Roll Dataset (3D and Unlabeled)

The dataset generated by make_swiss_roll is a synthetic representation of a three-dimensional Swiss roll, which is a common shape used for testing dimensionality reduction and clustering algorithms. This type of dataset has the following main characteristics:

from sklearn.datasets import make_swiss_roll
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate the Swiss roll dataset with 1000 samples
X, color = make_swiss_roll(n_samples=1000, noise=0.1)

# Extract coordinates
x = X[:, 0]
y = X[:, 1]
z = X[:, 2]

# Plot the dataset
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, c=color, cmap=plt.cm.viridis, s=50)
ax.set_title('Swiss Roll Dataset')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

In this example, make_swiss_roll generates a 3D “Swiss roll” dataset with 1000 samples. The noise parameter controls the amount of noise added to the data.

Next, the data is visualized using matplotlib with a 3D visualization. Each point in the dataset has three coordinates (X, Y, Z) and is colored based on a color value extracted from the dataset itself, which can be useful for representing additional information such as class labels or other.

This type of dataset is often used in clustering to test algorithms on data with a complex, nonlinear structure. Compared to other datasets generated by similar make_ functions in Scikit-learn, such as make_blobs or make_moons, the dataset generated by make_swiss_roll has a more intricate and three-dimensional structure, making it suitable for evaluating the effectiveness of clustering algorithms on data with shapes more complex and non-linear. Furthermore, it can also be used to test dimensionality reduction algorithms, as it offers a three-dimensional representation that can be projected into lower-dimensional spaces.

3D labeled datasets

The unlabeled clusters are all three-dimensional. As for Scikit-learn’s make_blobs, make_moons, make_circles, make_gaussian_quantiles and similar functions they are primarily designed to generate two-dimensional (2D) datasets for data visualization and analysis purposes. However, there is nothing to prevent you from using some of these functions to generate datasets in more than two dimensions (3D or higher), but the representation and visualization of the data becomes more complex.

3D datasets can be generated using make_blobs and make_gaussian_quantiles (the others are not) by specifying the appropriate number of features via the n_features parameter. For example, setting n_features=3 will generate data with three dimensions.

Here’s an example of how you might use make_blobs to generate a 3D dataset:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate a dataset with 3 features (3D)
X, y = make_blobs(n_samples=1000, n_features=3, centers=5, random_state=42)

# Plot the dataset in 3D
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', edgecolors='k')
ax.set_title('Example of 3D dataset with make_blobs')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Feature 3')
plt.show()
Exit mobile version