Multiclass Classification with Auto-Tuning CatBoost
CatBoost with HyperOpt becomes a perfect tool for multiclass classification…
In this article we will work on a Kaggle dataset of Date-fruits. The dataset consists of Date fruit of 7 types with 34 related features. Our task is to classify Date Fruits into 7 classes . You can read more on the dataset in the below link
After going through the above link, we understand that external appearance of a Date fruit greatly impacts its type. The 34 features including morphological features, shape and color extracted are from the 898 images of the fruit.
Different techniques from Logistic Regression to ANN have been used previously on this dataset . Performance results achieved with these methods are 91.0% and 92.2%, respectively. Stacking the models resulted in max performance of 92.8%.
We will use CatBoost and tune the hyperparameters with HyperOpt and check if we can do better or match the previous results.
Let us now import the dataset and do some analysis.
import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, confusion_matrix,precision_score
import catboost as ctbfrom hyperopt import hp
from hyperopt import fmin, tpe, STATUS_OK, STATUS_FAIL, Trials!pip install openpyxl # to read excel file#reading the Kaggle datasetdata_path = "../input/date-fruit-datasets/Date_Fruit_Datasets/Date_Fruit_Datasets.xlsx"
data=pd.read_excel(data_path)
df = data.copy()#To understand the data distribution
df.describe()#To understand the data types
df.dtypes#To check for nulls in data
df.isnull().sum()#checking class distribution
df['Class'].value_counts()*100/df.shape[0]
By running the above code on the Kaggle Notebook, we will be able to import Date_Fruit data into a pandas Data Frame (df). Post this we run few commands to analyze the data ,observations are
- After describing the data we see that the mean and median (50th Percentile) points for most features are very close. This means that the data is normally distributed.
- Up on checking the datatypes, we see that except the target variable (Class), all other variables are of numeric type
- There are no missing values in the dataset
- All the seven classes of date fruits are not uniformly distributed. The distribution percentage of each class is shown in below image.
With the above analysis and knowing that Logistic regression and ANN were previously used on this dataset (as mentioned in dataset description on Kaggle), we are left with using another powerful set of models.
The Boosting ones
Some of the common boosting algorithms that can be used here are XGBoost, LightGBM and CatBoost. Comparison of these algorithms are explained in the below article
From the above article we can conclude that LightGBM over powers Catboost and XGBoost in both performance and execution speed. The next one with respect to speed is Catboost and then followed by XGBoost. We can also see that the performance of these three algorithms are almost same.
For this time we will choose Catboost (the newest and less explored in the lot) and try to set the new benchmark.
Now, let us label encode the classes and create test and train data sets
#label encodeing the Class feature
le = LabelEncoder()
df['label'] = le.fit_transform(df.Class.values)#Creating train and test datasets by defining dependent and independent featuresX,Y=df.drop(['Class','label'],axis=1),df['label']
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.2,
random_state=42,
shuffle=True)
As we are using Catboost, lets explore the set of hyperparameters which can be tuned in the below link
From the above link we can see that there are quite a lot of hyperparameters with wide range of values. Some of the hyperparameters and its range of values are defined in the below code
#define parameter range
learning_rate=np.linspace(0.01,0.1,10)
max_depth=np.arange(2, 18, 2)
colsample_bylevel=np.arange(0.3, 0.8, 0.1)
iterations=np.arange(50, 1000, 50)
l2_leaf_reg=np.arange(0,10)
bagging_temperature=np.arange(0,100,10)#define the categorical features if any in the dataset for catboost to handle
categorical_features_indices = np.where(X_train.dtypes == np.object)[0]
Manually tuning all the hyperparameters and finding the right values to build the best model is very difficult. For this reason we use an automatic hyperparameter tuning package called Hyperopt. Hyperopt is a Distributed Asynchronous Hyperparameters Optimization package written in Python. Details of the Hyperopt is mentioned in the below link.
Hyperopt has mainly three components
- Objective function :
Objective function is the task at hand. It might be solving a simple linear algebra equation or just a simple if statement resulting in a action or training a machine learning model to find the best hyperparameters.
2. Search space :
This is the space of values which objective function can use. In our case we call it as parameter space. Here we define value range each hyperparameter can take. The one defined in the above code piece
3. Minimization function :
This is the function which minimizes the loss generated by the objective function over the parameter space or search space.
Defining the right loss function for each objective is key to get the best combinations of hyperparameters. Code for all the above steps to tune hyperparameters for Catboost is shown below
#Define parameter space, fit conditions
ctb_clf_params = {
'learning_rate': hp.choice('learning_rate', learning_rate),
'max_depth': hp.choice('max_depth', max_depth),
#'colsample_bylevel': hp.choice('colsample_bylevel', colsample_bylevel),
'iterations': hp.choice('iterations', iterations),
#'l2_leaf_reg': hp.choice('l2_leaf_reg', l2_leaf_reg),
#'bagging_temperature': hp.choice('bagging_temperature', bagging_temperature),
'loss_function': 'MultiClass',
}
ctb_fit_params = {
'early_stopping_rounds': 5,
'verbose': False,
'cat_features': categorical_features_indices
}
ctb_para = dict()
ctb_para['clf_params'] = ctb_clf_params
ctb_para['fit_params'] = ctb_fit_params
In the above code we are defining the parameter space, with fit condition. (Some of the parameter spaces are commented as tuning on these parameters decreased the performance. Hence, default values are used. Representation is just to give an idea in case of an implementation). Fit conditions define the model fitting conditions.
#define Hyperopt class
class HYPOpt(object):
def __init__(self, x_train, x_test, y_train, y_test):
self.x_train = x_train
self.x_test = x_test
self.y_train = y_train
self.y_test = y_test
def process(self, fn_name, space, trials, algo, max_evals):
fn = getattr(self, fn_name)
try:
print('entering fmin')
result = fmin(fn=fn, space=space, algo=algo, max_evals=max_evals, trials=trials) #----1
except Exception as e:
return {'status': STATUS_FAIL,
'exception': str(e)}
return result
def ctb_clf(self, para): #---- 2
clf= ctb.CatBoostClassifier(**para['reg_params'])
print('ctb initialized')
return self.train_clf(reg, para)
def train_clf(self, clf, para): #----- 3
print('fitting model')
clf.fit(self.x_train, self.y_train,
eval_set=[(self.x_train, self.y_train), (self.x_test, self.y_test)],
**para['fit_params'])
print('model fitted')
pred = clf.predict(self.x_test)
f1=sklearn.metrics.f1_score(self.y_test,pred,average='micro')
f1=f1*(-1) #---- 4
print(f1)
return {'loss': f1, 'status': STATUS_OK}
#1- Defining inbuilt Hyperopt minimization function
#2- Objective fine
#3- Function to fit the model
#4- Loss function (negative F1 used as loss)
In the above code, we create a Hyperopt class and define an objective function. Here the objective is to build a Catboost model. We use function ctb_clf to define the model and train_clf to train the model.
We also define a loss function. Here we use F1 as metrics. Hyperopt always tries to minimize the loss with its inbuild fmin method. Hence we pass negative F1 into fmin to in turn maximize the F1 score.
#define objective and find the best hyperparameters
obj = HYPOpt(X_train, X_test, y_train, y_test)
ctb_opt = obj.process(fn_name='ctb_clf', space=ctb_para, trials=Trials(), algo=tpe.suggest, max_evals=5)
In the above code, we instantiate the class and call the process function to start the Hyperparameter optimization . Five iterations (max_evals =5) of catboost are created with different combinations of parameters from parameter space ctb_para. The parameters producing the minimum loss is stored as the output of process function into ctb_opt as shown in figure 2.
The above output is the index of the values of the hyperparameter space. We need to store the actual values. A dictionary is used to store the actual values as shown in below code
#Save best parametrs in a dictionary
best_param={}
best_param['learning_rate']=learning_rate[ctb_opt['learning_rate']]
best_param['iterations']=iterations[ctb_opt['iterations']]
best_param['max_depth']=max_depth[ctb_opt['max_depth']]
The optimal parameters chosen by Hyperopt looks something like this
With best parameters from Hyperopt, we use these to build our final Catboost model and test it on test-data.
#build model with the best hyperparameters
model=ctb.CatBoostClassifier(iterations=best_param['iterations'],
depth=best_param['max_depth'],
learning_rate=best_param['learning_rate'],
loss_function='MultiClass',
random_seed=42
)model.fit(X_train,y_train,cat_features=categorical_features_indices, eval_set=None, plot=True)
With the best model built, we need to verify the results on test data. We will check various metrics from confusion matrix to accuracy of train and test data to check for overfitting. We will also see micro , macro and weighted precision , recall and F1- Score to understand the model performance on each class.
Below code helps us to check the accuracy on test and train and get confusion matrix
train_pred = model.predict(X_train)
train_acc = accuracy_score(y_train,train_pred)
print('Train Accuracy: ', train_acc)
test_pred = model.predict(X_test)
test_acc = accuracy_score(y_test,test_pred)
print('Test Accuracy:', test_acc)
y_pred=test_predconfusion = confusion_matrix(y_test, test_pred)
print('Confusion Matrix\n')
print(confusion)
We can check micro, macro and weighted precision, recall and F1 score with the help of below code
print('Micro Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='micro')))
print('Micro Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='micro')))
print('Micro F1-score: {:.2f}\n'.format(f1_score(y_test, y_pred, average='micro')))
print('Macro Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='macro')))
print('Macro Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='macro')))
print('Macro F1-score: {:.2f}\n'.format(f1_score(y_test, y_pred, average='macro')))
print('Weighted Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='weighted')))
print('Weighted Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='weighted')))
print('Weighted F1-score: {:.2f}'.format(f1_score(y_test, y_pred, average='weighted')))
Output of all the codes are shown in following figures
From the above figures we are able to see that the test accuracy is equal to the best results obtained by the Kaggle competition. The micro and weighted metrics exceed the best results.
Summary
In this article we tried to implement Catboost to solve a Kaggle dataset. Some of the key points we learnt are:
- Catboost is a very powerful model. It is mainly used when independent features contain categorical variables.
- Catboost has large number of parameters to tune.
- HyperOpt is an efficient and effective way to find the best hyperparameters for all most all ML models.
Link to the Kaggle notebook with full code is below :