The Naive Bayes algorithm is a probabilistic classifier based on Bayes’ theorem. It is often used for classification problems, where the goal is to assign a class or category to a set of data based on certain characteristics. The “naive” approach derives from the assumption of conditional independence of characteristics, which simplifies the calculation of probabilities.
[wpda_org_chart tree_id=19 theme_id=50]
Naive Bayes Classifiers in Scikit-Learn
In Python, you can use the scikit-learn library to implement the Naive Bayes classifier. There are mainly three types of Naive Bayes classifiers offered by scikit-learn:
- Gaussian Naive Bayes:This is suitable for continuous data where the features are assumed to follow a Gaussian (normal) distribution.
- Multinomial Naive Bayes: This is mainly used for discrete data such as texts. It is often applied in text analysis, for example for document classification.
- Bernoulli Naive Bayes: This is similar to multinomial, but is used for binary data, such as the presence or absence of certain words in a document.
Gaussian Naive Bayes
Gaussian Naive Bayes is a variant of the Naive Bayes classifier that assumes that the features of the dataset are distributed according to a normal (Gaussian) distribution. Naive Bayes is a classification algorithm based on Bayes’ Theorem, which exploits the conditional independence between features given a class label. In this particular case, the distribution of features given a class label is assumed to be Gaussian (normal). This assumption further simplifies the calculation of conditional probabilities.
Here is an example of how to use the Gaussian Naive Bayes classifier:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
# Let's generate a sample dataset
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)
# We divide the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Evaluation of model accuracy
accuracy = gnb.score(X_test, y_test)
print(f'Model accuracy: {accuracy:.2f}')
# Let's create a meshgrid to visualize the decision boundary
h = .02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# We predict the classes for each point in the meshgrid
Z = gnb.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# We visualize the points of the dataset and the decision boundary
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k', marker='o')
plt.title('Gaussian Naive Bayes - Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Let’s analyze the various parts of the code together.
- Generating the example dataset:
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)
We use scikit-learn’s make_classification function to generate an example dataset with 100 samples, 2 informative features, 0 redundant features, and 1 cluster per class.
- Division of the dataset into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
We divide the generated dataset into training (80%) and testing (20%) sets.
- Training the Gaussian Naive Bayes model:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
We create an instance of the Gaussian Naive Bayes (GaussianNB) model, train it on the training set using the fit method.
- Evaluation of model accuracy:
accuracy = gnb.score(X_test, y_test)
print(f'Accuratezza del modello: {accuracy:.2f}')
We calculate the accuracy of the model on the test data using the score method which returns the accuracy of the model on the test data.
- Creating a meshgrid for viewing the decision boundary:
h = .02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Let’s create a meshgrid to cover the feature space.
- Prediction of classes for each point in the meshgrid:
Z = gnb.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
We apply the prediction of the Gaussian Naive Bayes model to each point of the meshgrid.
- Viewing results:
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k', marker='o')
plt.title('Gaussian Naive Bayes - Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Finally running the code you will obtain the following accuracy value:
Model accuracy: 0.90
And then the following graph will be displayed.
The points of the dataset and the decision boundary generated by the model are represented. The points in the dataset are colored according to their class, while the colored area represents the decision boundary predicted by the Gaussian Naive Bayes model.
This example illustrates the training and evaluation process of a Gaussian Naive Bayes classifier, as well as the visualization of the resulting decision boundary.
Advantages and considerations:
- Simplicity: Gaussian Naive Bayes is known for its simplicity and ease of implementation.
- Good performance with data well approximated by the Gaussian: Works particularly well when the normal distribution assumption is reasonably accurate for the dataset features.
Limitations:
- Simplifying assumptions: The conditional independence assumption and the Gaussian distribution may be oversimplified for complex datasets or those with highly correlated features.
- Sensitive to outliers: Being based on the normal distribution, it is sensitive to outliers in features.
In summary, Gaussian Naive Bayes is an efficient and easy-to-implement classifier, particularly useful when the assumptions on the distribution of features are reasonably valid for the dataset under consideration. However, it is important to consider the limitations and carefully evaluate its suitability for the particular classification problem you are addressing.
Multinomial Naive Bayes
Multinomial Naive Bayes is a variant of the Naive Bayes classifier specifically designed for discrete data or frequency counting, typically used in text classification tasks. This algorithm assumes that features are represented by multinomial distributions, making it particularly suitable for data such as word counts in documents.
- Bayes theorem:
Multinomial Naive Bayes uses Bayes’ Theorem to calculate the posterior probabilities of classes given feature counts. - Assumption of conditional independence:
Like the standard Naive Bayes classifier, Multinomial Naive Bayes assumes conditional independence of features given the class label. This assumption simplifies the calculation of conditional probabilities. - Multinomial distribution:
It is assumed that the distribution of features, represented by the counts, follows a multinomial distribution. This means that the model takes into account the frequencies of each word or feature within a class.
Typical applications:
- Classification of texts:
Multinomial Naive Bayes is widely used in text classification, such as document categorization, spam filtering, sentiment analysis, and more. - Data representation:
It is commonly used when features can be represented as frequency counts, such as the number of occurrences of each word in a document.
Let’s see an example on how to apply this algorithm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Sample data: Documents and their binary labels
documents = [
"This is a technology article about artificial intelligence.",
"A recipe for a delicious chocolate cake.",
"Tips for effective time management.",
"Latest fashion trends for the season.",
"Overview of sustainable energy sources.",
"How to improve your programming skills.",
"Healthy lifestyle habits for longevity.",
"Travel guide to exotic destinations.",
]
# We assign binary labels to indicate whether the document is technological or not
labels = [1, 0, 0, 0, 1, 1, 0, 0]
# Creating a word count matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
# Division of the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
# Training the Multinomial Naive Bayes model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
# Evaluation of model accuracy
y_pred = mnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuratezza del modello: {accuracy:.2f}')
# Classification report
print("Classification report:")
print(classification_report(y_test, y_pred, zero_division=1))
Let’s analyze the various parts of the code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
- Import of libraries:
- CountVectorizer: A scikit-learn method that converts a collection of text documents into an array of word counts.
- MultinomialNB: The scikit-learn implementation of the multinomial Naive Bayes classifier.
- train_test_split: A scikit-learn function that splits the dataset into training and test sets.
- accuracy_score: A scikit-learn function to calculate the accuracy of the model.
- classification_report: A scikit-learn function that generates a detailed report of classification metrics.
documents = [
"This is a technology article about artificial intelligence.",
"A recipe for a delicious chocolate cake.",
"Tips for effective time management.",
"Latest fashion trends for the season.",
"Overview of sustainable energy sources.",
"How to improve your programming skills.",
"Healthy lifestyle habits for longevity.",
"Travel guide to exotic destinations.",
]
labels = [1, 0, 0, 0, 1, 1, 0, 0]
- Data definition:
- documents: A list of text documents representing various topics.
- labels: A list of binary labels (1 for technology documents, 0 for others).
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
- Creating the word count matrix:
- CountVectorizer: A scikit-learn method that converts text documents into an array of word counts.
- fit_transform: Train the vectorizer and transform documents into a matrix of word counts.
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
- Division of the dataset:
- train_test_split: A scikit-learn function that splits the dataset into training and test sets.
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
- Model Training:
- MultinomialNB: Scikit-learn’s multinomial Naive Bayes classifier.
- fit: Train the model on the training set.
y_pred = mnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuratezza del modello: {accuracy:.2f}')
- Model rating:
- accuracy_score: Calculates the accuracy of the model by comparing predictions to test labels.
- predict: Predict labels for the test set.
- Print model accuracy.
print("Classification report:")
print(classification_report(y_test, y_pred, zero_division=1))
- Classification report:
- classification_report: Generates a detailed report of classification metrics (precision, recall, F1-score).
- zero_division=1: Handles the case where there are divisions by zero, setting the values to 1 in the metrics. It can be adjusted according to needs.
Executing you get the following result:
Model Accuracy: 0.50
Classification report::
precision recall f1-score support
0 0.50 1.00 0.67 1
1 1.00 0.00 0.00 1
accuracy 0.50 2
macro avg 0.75 0.50 0.33 2
weighted avg 0.75 0.50 0.33 2
Bernoulli Naive Bayes
Bernoulli Naive Bayes is another variant of the Naive Bayes classifier which is based on the Bernoulli distribution of features. This variant is often used when dealing with binary data, where features can be present (1) or absent (0).
Fundamental principles:
- Bayes theorem:
Bernoulli Naive Bayes applies Bayes’ Theorem to calculate the posterior probabilities of the classes given the binary feature vector. - Assumption of conditional independence:
Like other Naive Bayes classifiers, Bernoulli Naive Bayes assumes conditional independence of features given the class label. This simplifies the calculation of conditional probabilities. - Bernoulli distribution:
The distribution of features is assumed to be Bernoulli type, i.e. the features are binary (0 or 1).
Typical applications:
- Binary data:
It is commonly used when features are represented by binary data, such as the presence or absence of words in a document. - Classification of texts:
It can be applied to text classification problems, especially when using binary representations of words (e.g., through the TF-IDF binarization technique).
Example with Python code:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Let's generate a balanced example dataset
X, y = make_classification(n_samples=100, n_features=20, n_classes=2, n_clusters_per_class=1, random_state=42)
# Division of the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training the Bernoulli Naive Bayes model
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
# Evaluation of model accuracy
y_pred = bnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')
# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Confusion matrix visualization
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
xticklabels=["Negative", "Positive"],
yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()
Executing you will get the following result:
Model Accuracy: 1.00
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 10
accuracy 1.00 20
macro avg 1.00 1.00 1.00 20
weighted avg 1.00 1.00 1.00 20
With a confusion matrix to better visualize the classification results from the model.