Multiclass Classification with Auto-Tuning CatBoost

NandaKishore Joshi
8 min readMay 26, 2022

CatBoost with HyperOpt becomes a perfect tool for multiclass classification…

Image from Kaggle

In this article we will work on a Kaggle dataset of Date-fruits. The dataset consists of Date fruit of 7 types with 34 related features. Our task is to classify Date Fruits into 7 classes . You can read more on the dataset in the below link

After going through the above link, we understand that external appearance of a Date fruit greatly impacts its type. The 34 features including morphological features, shape and color extracted are from the 898 images of the fruit.

Different techniques from Logistic Regression to ANN have been used previously on this dataset . Performance results achieved with these methods are 91.0% and 92.2%, respectively. Stacking the models resulted in max performance of 92.8%.

We will use CatBoost and tune the hyperparameters with HyperOpt and check if we can do better or match the previous results.

Let us now import the dataset and do some analysis.

import numpy as np 
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, confusion_matrix,precision_score

import catboost as ctb
from hyperopt import hp
from hyperopt import fmin, tpe, STATUS_OK, STATUS_FAIL, Trials
!pip install openpyxl # to read excel file#reading the Kaggle datasetdata_path = "../input/date-fruit-datasets/Date_Fruit_Datasets/Date_Fruit_Datasets.xlsx"
data=pd.read_excel(data_path)
df = data.copy()#To understand the data distribution
df.describe()
#To understand the data types
df.dtypes
#To check for nulls in data
df.isnull().sum()
#checking class distribution
df['Class'].value_counts()*100/df.shape[0]

By running the above code on the Kaggle Notebook, we will be able to import Date_Fruit data into a pandas Data Frame (df). Post this we run few commands to analyze the data ,observations are

  1. After describing the data we see that the mean and median (50th Percentile) points for most features are very close. This means that the data is normally distributed.
  2. Up on checking the datatypes, we see that except the target variable (Class), all other variables are of numeric type
  3. There are no missing values in the dataset
  4. All the seven classes of date fruits are not uniformly distributed. The distribution percentage of each class is shown in below image.
Figure 1 : Class distribution (in %)

With the above analysis and knowing that Logistic regression and ANN were previously used on this dataset (as mentioned in dataset description on Kaggle), we are left with using another powerful set of models.

The Boosting ones

Some of the common boosting algorithms that can be used here are XGBoost, LightGBM and CatBoost. Comparison of these algorithms are explained in the below article

From the above article we can conclude that LightGBM over powers Catboost and XGBoost in both performance and execution speed. The next one with respect to speed is Catboost and then followed by XGBoost. We can also see that the performance of these three algorithms are almost same.

For this time we will choose Catboost (the newest and less explored in the lot) and try to set the new benchmark.

Now, let us label encode the classes and create test and train data sets

#label encodeing the Class feature
le = LabelEncoder()
df['label'] = le.fit_transform(df.Class.values)
#Creating train and test datasets by defining dependent and independent featuresX,Y=df.drop(['Class','label'],axis=1),df['label']
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.2,
random_state=42,
shuffle=True)

As we are using Catboost, lets explore the set of hyperparameters which can be tuned in the below link

From the above link we can see that there are quite a lot of hyperparameters with wide range of values. Some of the hyperparameters and its range of values are defined in the below code

#define parameter range
learning_rate=np.linspace(0.01,0.1,10)
max_depth=np.arange(2, 18, 2)
colsample_bylevel=np.arange(0.3, 0.8, 0.1)
iterations=np.arange(50, 1000, 50)
l2_leaf_reg=np.arange(0,10)
bagging_temperature=np.arange(0,100,10)
#define the categorical features if any in the dataset for catboost to handle
categorical_features_indices = np.where(X_train.dtypes == np.object)[0]

Manually tuning all the hyperparameters and finding the right values to build the best model is very difficult. For this reason we use an automatic hyperparameter tuning package called Hyperopt. Hyperopt is a Distributed Asynchronous Hyperparameters Optimization package written in Python. Details of the Hyperopt is mentioned in the below link.

Hyperopt has mainly three components

  1. Objective function :

Objective function is the task at hand. It might be solving a simple linear algebra equation or just a simple if statement resulting in a action or training a machine learning model to find the best hyperparameters.

2. Search space :

This is the space of values which objective function can use. In our case we call it as parameter space. Here we define value range each hyperparameter can take. The one defined in the above code piece

3. Minimization function :

This is the function which minimizes the loss generated by the objective function over the parameter space or search space.

Defining the right loss function for each objective is key to get the best combinations of hyperparameters. Code for all the above steps to tune hyperparameters for Catboost is shown below

#Define parameter space, fit conditions
ctb_clf_params = {
'learning_rate': hp.choice('learning_rate', learning_rate),
'max_depth': hp.choice('max_depth', max_depth),
#'colsample_bylevel': hp.choice('colsample_bylevel', colsample_bylevel),
'iterations': hp.choice('iterations', iterations),
#'l2_leaf_reg': hp.choice('l2_leaf_reg', l2_leaf_reg),
#'bagging_temperature': hp.choice('bagging_temperature', bagging_temperature),
'loss_function': 'MultiClass',

}
ctb_fit_params = {
'early_stopping_rounds': 5,
'verbose': False,
'cat_features': categorical_features_indices
}
ctb_para = dict()
ctb_para['clf_params'] = ctb_clf_params
ctb_para['fit_params'] = ctb_fit_params

In the above code we are defining the parameter space, with fit condition. (Some of the parameter spaces are commented as tuning on these parameters decreased the performance. Hence, default values are used. Representation is just to give an idea in case of an implementation). Fit conditions define the model fitting conditions.

#define Hyperopt class
class HYPOpt(object):

def __init__(self, x_train, x_test, y_train, y_test):
self.x_train = x_train
self.x_test = x_test
self.y_train = y_train
self.y_test = y_test

def process(self, fn_name, space, trials, algo, max_evals):
fn = getattr(self, fn_name)
try:
print('entering fmin')
result = fmin(fn=fn, space=space, algo=algo, max_evals=max_evals, trials=trials) #----1
except Exception as e:
return {'status': STATUS_FAIL,
'exception': str(e)}
return result

def ctb_clf(self, para): #---- 2
clf= ctb.CatBoostClassifier(**para['reg_params'])
print('ctb initialized')
return self.train_clf(reg, para)

def train_clf(self, clf, para): #----- 3
print('fitting model')
clf.fit(self.x_train, self.y_train,
eval_set=[(self.x_train, self.y_train), (self.x_test, self.y_test)],
**para['fit_params'])
print('model fitted')
pred = clf.predict(self.x_test)
f1=sklearn.metrics.f1_score(self.y_test,pred,average='micro')
f1=f1*(-1) #---- 4
print(f1)
return {'loss': f1, 'status': STATUS_OK}

#1- Defining inbuilt Hyperopt minimization function

#2- Objective fine

#3- Function to fit the model

#4- Loss function (negative F1 used as loss)

In the above code, we create a Hyperopt class and define an objective function. Here the objective is to build a Catboost model. We use function ctb_clf to define the model and train_clf to train the model.

We also define a loss function. Here we use F1 as metrics. Hyperopt always tries to minimize the loss with its inbuild fmin method. Hence we pass negative F1 into fmin to in turn maximize the F1 score.

#define objective and find the best hyperparameters

obj = HYPOpt(X_train, X_test, y_train, y_test)
ctb_opt = obj.process(fn_name='ctb_clf', space=ctb_para, trials=Trials(), algo=tpe.suggest, max_evals=5)

In the above code, we instantiate the class and call the process function to start the Hyperparameter optimization . Five iterations (max_evals =5) of catboost are created with different combinations of parameters from parameter space ctb_para. The parameters producing the minimum loss is stored as the output of process function into ctb_opt as shown in figure 2.

Figure 2 : Indexes of Optimal parameters

The above output is the index of the values of the hyperparameter space. We need to store the actual values. A dictionary is used to store the actual values as shown in below code

#Save best parametrs in a dictionary
best_param={}
best_param['learning_rate']=learning_rate[ctb_opt['learning_rate']]
best_param['iterations']=iterations[ctb_opt['iterations']]
best_param['max_depth']=max_depth[ctb_opt['max_depth']]

The optimal parameters chosen by Hyperopt looks something like this

Figure 3 : Actual value of Optimal parameters

With best parameters from Hyperopt, we use these to build our final Catboost model and test it on test-data.

#build model with the best hyperparameters
model=ctb.CatBoostClassifier(iterations=best_param['iterations'],
depth=best_param['max_depth'],
learning_rate=best_param['learning_rate'],
loss_function='MultiClass',
random_seed=42
)
model.fit(X_train,y_train,cat_features=categorical_features_indices, eval_set=None, plot=True)

With the best model built, we need to verify the results on test data. We will check various metrics from confusion matrix to accuracy of train and test data to check for overfitting. We will also see micro , macro and weighted precision , recall and F1- Score to understand the model performance on each class.

Below code helps us to check the accuracy on test and train and get confusion matrix

train_pred = model.predict(X_train)
train_acc = accuracy_score(y_train,train_pred)
print('Train Accuracy: ', train_acc)

test_pred = model.predict(X_test)
test_acc = accuracy_score(y_test,test_pred)
print('Test Accuracy:', test_acc)
y_pred=test_pred
confusion = confusion_matrix(y_test, test_pred)
print('Confusion Matrix\n')
print(confusion)

We can check micro, macro and weighted precision, recall and F1 score with the help of below code

print('Micro Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='micro')))
print('Micro Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='micro')))
print('Micro F1-score: {:.2f}\n'.format(f1_score(y_test, y_pred, average='micro')))

print('Macro Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='macro')))
print('Macro Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='macro')))
print('Macro F1-score: {:.2f}\n'.format(f1_score(y_test, y_pred, average='macro')))

print('Weighted Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='weighted')))
print('Weighted Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='weighted')))
print('Weighted F1-score: {:.2f}'.format(f1_score(y_test, y_pred, average='weighted')))

Output of all the codes are shown in following figures

Figure 4 : Train and Test Accuracy
Figure 5 : Confusion matrix
Figure 6 : Micro, Macro, Weighted metrices

From the above figures we are able to see that the test accuracy is equal to the best results obtained by the Kaggle competition. The micro and weighted metrics exceed the best results.

Summary

In this article we tried to implement Catboost to solve a Kaggle dataset. Some of the key points we learnt are:

  1. Catboost is a very powerful model. It is mainly used when independent features contain categorical variables.
  2. Catboost has large number of parameters to tune.
  3. HyperOpt is an efficient and effective way to find the best hyperparameters for all most all ML models.

Link to the Kaggle notebook with full code is below :

--

--