The C4.5 algorithm
The C4.5 algorithm is a widely used machine learning algorithm for building decision trees. Decision trees are a form of predictive model that can be used for both classification and regression problems. C4.5 is an improved version of the ID3 (Iterative Dichotomiser 3) algorithm developed by Ross Quinlan and was introduced in the 1990s. Here’s how the C4.5 algorithm works and how to use it in Python.
Operation of the C4.5 Algorithm:
The main goal of C4.5 is to build a decision tree that can be used to classify examples or predict continuous values based on characteristics (predictor variables) of the training data. Here is an overview of the main steps of the C4.5 algorithm:
- Predictor Variable Selection: The C4.5 algorithm begins by selecting the predictor variable (feature) that offers the best split of the training data based on the target variable (class or value to be predicted). This selection is based on measures of impurity such as entropy or the Gini index.
- Data Splitting: Once the predictor variable is selected, the C4.5 algorithm splits the training data based on the values of the predictor variable. For continuous variables, optimal cut points are identified.
- Recursion: The process of selecting the predictor variable and splitting the data is repeated recursively for each subset of data created by the previous split. The algorithm continues to build the tree until a stopping criterion is met, such as a maximum depth reached or sufficient purity of leaf nodes.
- Creation of leaf nodes: When a node is reached where all examples belong to the same class (for classification problems) or the change in the target value is below a certain threshold (for regression problems), a node is created leaf representing the prediction.
A bit of history
The C4.5 algorithm was developed by Ross Quinlan and is one of the milestones in the history of machine learning. Its development began in the early 1980s, and was first introduced in 1986 in Quinlan’s book entitled “Induction of Decision Trees”.
Here are some important milestones in the history of the C4.5 algorithm
- ID3 (Iterative Dichotomiser 3): Before C4.5, Quinlan had developed a predecessor called ID3. ID3 was a machine learning algorithm that could build decision trees using the concept of entropy and select the best splitting variable based on measures of information. C4.5 was created as an improvement on ID3..
- C4.5: C4.5 introduced many improvements over ID3. For example, C4.5 can handle data with continuous variables, while ID3 only dealt with categorical variables. C4.5 also introduced the concept of “gain ratio” for split variable selection, which is a ratio-based measure of impurity, improving the handling of variables with many categories. These improvements have made it one of the most widely used and effective decision tree algorithms.
- Success and usage: C4.5 has achieved great success in the machine learning community and has found applications in a wide range of industries, including medicine, finance, engineering, and more. Its ease of interpretation and ability to handle both classification and regression problems have made it an invaluable tool for data analysts and machine learning experts.
- Subsequent development: C4.5 became the basis for the development of many other decision tree algorithms and ensemble algorithms such as Random Forest and Gradient Boosting. These algorithms are based on the concept of building decision trees and have further expanded the field of machine learning.
The importance of C4.5 in the history of machine learning cannot be understated. It laid the foundation for numerous subsequent developments in decision trees and contributed significantly to the spread and use of machine learning in a wide range of practical applications.
The C4.5 algorithm with Python
Scikit-learn, one of the most popular libraries for machine learning in Python, does not include a specific implementation of the C4.5 algorithm. However, scikit-learn provides a decision tree algorithm that uses a similar approach to C4.5, but relies on entropy as the default criterion for variable selection. You can set the criterion to entropy to get C4.5-like behavior when building a decision tree model, as I showed in the previous example.
If you have a specific need to work with the C4.5 algorithm in Python, you can look for custom libraries or implementations that support it. However, please note that C4.5 is a copyright and patent protected algorithm, and has been superseded by C5.0 (its successor) and other variants of decision tree algorithms.
On GitHub there is an opensource version of the C4.5 algorithm in Python developed by R. Quinlan.
A popular library that offers an open source version of C4.5 is “C5.0 for Python,” which implements a compatible version of C5.0. However, C5.0 has additional features over C4.5, and is often used for commercial purposes. You can find more information about the C5.0 library for Python on GitHub or other websites that provide the implementation.
Remember that when using any library or algorithm implementation, it is important to respect copyrights and patents, if applicable.