Site icon Meccanismo Complesso

Examples of using Naive Bayes Classifiers with Scikit-Learn in Python

Naive Bayes Classification
Naive Bayes Classification header

The Naive Bayes algorithm is a probabilistic classifier based on Bayes’ theorem. It is often used for classification problems, where the goal is to assign a class or category to a set of data based on certain characteristics. The “naive” approach derives from the assumption of conditional independence of characteristics, which simplifies the calculation of probabilities.

[wpda_org_chart tree_id=19 theme_id=50]

Naive Bayes Classifiers in Scikit-Learn

In Python, you can use the scikit-learn library to implement the Naive Bayes classifier. There are mainly three types of Naive Bayes classifiers offered by scikit-learn:

  1. Gaussian Naive Bayes:This is suitable for continuous data where the features are assumed to follow a Gaussian (normal) distribution.
  2. Multinomial Naive Bayes: This is mainly used for discrete data such as texts. It is often applied in text analysis, for example for document classification.
  3. Bernoulli Naive Bayes: This is similar to multinomial, but is used for binary data, such as the presence or absence of certain words in a document.

Gaussian Naive Bayes

Gaussian Naive Bayes is a variant of the Naive Bayes classifier that assumes that the features of the dataset are distributed according to a normal (Gaussian) distribution. Naive Bayes is a classification algorithm based on Bayes’ Theorem, which exploits the conditional independence between features given a class label. In this particular case, the distribution of features given a class label is assumed to be Gaussian (normal). This assumption further simplifies the calculation of conditional probabilities.

Here is an example of how to use the Gaussian Naive Bayes classifier:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification

# Let's generate a sample dataset
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)

# We divide the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Evaluation of model accuracy
accuracy = gnb.score(X_test, y_test)
print(f'Model accuracy: {accuracy:.2f}')

# Let's create a meshgrid to visualize the decision boundary
h = .02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# We predict the classes for each point in the meshgrid
Z = gnb.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# We visualize the points of the dataset and the decision boundary
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k', marker='o')
plt.title('Gaussian Naive Bayes - Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Let’s analyze the various parts of the code together.

  1. Generating the example dataset:
   X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)

We use scikit-learn’s make_classification function to generate an example dataset with 100 samples, 2 informative features, 0 redundant features, and 1 cluster per class.

  1. Division of the dataset into training and test sets:
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We divide the generated dataset into training (80%) and testing (20%) sets.

  1. Training the Gaussian Naive Bayes model:
   gnb = GaussianNB()
   gnb.fit(X_train, y_train)

We create an instance of the Gaussian Naive Bayes (GaussianNB) model, train it on the training set using the fit method.

  1. Evaluation of model accuracy:
   accuracy = gnb.score(X_test, y_test)
   print(f'Accuratezza del modello: {accuracy:.2f}')

We calculate the accuracy of the model on the test data using the score method which returns the accuracy of the model on the test data.

  1. Creating a meshgrid for viewing the decision boundary:
   h = .02
   x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
   y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
   xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Let’s create a meshgrid to cover the feature space.

  1. Prediction of classes for each point in the meshgrid:
   Z = gnb.predict(np.c_[xx.ravel(), yy.ravel()])
   Z = Z.reshape(xx.shape)

We apply the prediction of the Gaussian Naive Bayes model to each point of the meshgrid.

  1. Viewing results:
   plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.3)
   plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k', marker='o')
   plt.title('Gaussian Naive Bayes - Decision Boundary')
   plt.xlabel('Feature 1')
   plt.ylabel('Feature 2')
   plt.show()

Finally running the code you will obtain the following accuracy value:

Model accuracy: 0.90

And then the following graph will be displayed.

The points of the dataset and the decision boundary generated by the model are represented. The points in the dataset are colored according to their class, while the colored area represents the decision boundary predicted by the Gaussian Naive Bayes model.

This example illustrates the training and evaluation process of a Gaussian Naive Bayes classifier, as well as the visualization of the resulting decision boundary.

Advantages and considerations:

Limitations:

In summary, Gaussian Naive Bayes is an efficient and easy-to-implement classifier, particularly useful when the assumptions on the distribution of features are reasonably valid for the dataset under consideration. However, it is important to consider the limitations and carefully evaluate its suitability for the particular classification problem you are addressing.

Multinomial Naive Bayes

Multinomial Naive Bayes is a variant of the Naive Bayes classifier specifically designed for discrete data or frequency counting, typically used in text classification tasks. This algorithm assumes that features are represented by multinomial distributions, making it particularly suitable for data such as word counts in documents.

  1. Bayes theorem:
    Multinomial Naive Bayes uses Bayes’ Theorem to calculate the posterior probabilities of classes given feature counts.
  2. Assumption of conditional independence:
    Like the standard Naive Bayes classifier, Multinomial Naive Bayes assumes conditional independence of features given the class label. This assumption simplifies the calculation of conditional probabilities.
  3. Multinomial distribution:
    It is assumed that the distribution of features, represented by the counts, follows a multinomial distribution. This means that the model takes into account the frequencies of each word or feature within a class.

Typical applications:

  1. Classification of texts:
    Multinomial Naive Bayes is widely used in text classification, such as document categorization, spam filtering, sentiment analysis, and more.
  2. Data representation:
    It is commonly used when features can be represented as frequency counts, such as the number of occurrences of each word in a document.

Let’s see an example on how to apply this algorithm

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Sample data: Documents and their binary labels
documents = [
    "This is a technology article about artificial intelligence.",
    "A recipe for a delicious chocolate cake.",
    "Tips for effective time management.",
    "Latest fashion trends for the season.",
    "Overview of sustainable energy sources.",
    "How to improve your programming skills.",
    "Healthy lifestyle habits for longevity.",
    "Travel guide to exotic destinations.",
]

# We assign binary labels to indicate whether the document is technological or not
labels = [1, 0, 0, 0, 1, 1, 0, 0]

# Creating a word count matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Division of the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Training the Multinomial Naive Bayes model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Evaluation of model accuracy
y_pred = mnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuratezza del modello: {accuracy:.2f}')

# Classification report
print("Classification report:")
print(classification_report(y_test, y_pred, zero_division=1))

Let’s analyze the various parts of the code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
documents = [
    "This is a technology article about artificial intelligence.",
    "A recipe for a delicious chocolate cake.",
    "Tips for effective time management.",
    "Latest fashion trends for the season.",
    "Overview of sustainable energy sources.",
    "How to improve your programming skills.",
    "Healthy lifestyle habits for longevity.",
    "Travel guide to exotic destinations.",
]

labels = [1, 0, 0, 0, 1, 1, 0, 0]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuratezza del modello: {accuracy:.2f}')
print("Classification report:")
print(classification_report(y_test, y_pred, zero_division=1))

Executing you get the following result:

Model Accuracy: 0.50
Classification report::
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       1.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.75      0.50      0.33         2
weighted avg       0.75      0.50      0.33         2

Bernoulli Naive Bayes

Bernoulli Naive Bayes is another variant of the Naive Bayes classifier which is based on the Bernoulli distribution of features. This variant is often used when dealing with binary data, where features can be present (1) or absent (0).

Fundamental principles:

Typical applications:

Example with Python code:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Let's generate a balanced example dataset
X, y = make_classification(n_samples=100, n_features=20, n_classes=2, n_clusters_per_class=1, random_state=42)

# Division of the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the Bernoulli Naive Bayes model
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# Evaluation of model accuracy
y_pred = bnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Confusion matrix visualization
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["Negative", "Positive"],
            yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()

Executing you will get the following result:

Model Accuracy: 1.00
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        10

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20

With a confusion matrix to better visualize the classification results from the model.

Exit mobile version