Category: Catboost

Creating a Custom Loss Function For Machine Learning Models

While standard Machine Learning Libraries provide a vast array of loss functions out of the box, sometimes we need to create our own custom loss function. In this blog post, I’ll go over a simple example and create a custom loss function in Catboost.

First we will create the data for training.

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool
from sklearn.datasets import fetch_california_housing

raw_data = fetch_california_housing()

data = pd.concat([pd.DataFrame(raw_data['data'], columns=raw_data['feature_names']), 
                  pd.Series(raw_data['target'], name = 'target')], axis = 1)

features = [i for i in data.columns.tolist() if i != 'target']

Since the objective is not to create the best model possible, we won’t be doing any feature engineering. Let’s use catboost, and create a model using standard loss functions.

model = CatBoostRegressor(loss_function='RMSE', n_estimators=100, eval_metric='RMSE')

cb_pool = Pool(data=data[features], label=data['target'], feature_names=features)

model.fit(cb_pool)

predictions = model.predict(cb_pool)

mean_squared_error(y_true=data['target'], y_pred=predictions)

Upon evaluating the model we find that the mean squared error is 0.15. Definitely a model which is overfitting, but that’s not a concern for this tutorial.

But what is you don’t want to use RMSE as a loss function, and instead want to use something like this –

$loss = \frac{\sum (y - \hat{y})^{4}}{n}$

Then how do you create a loss function in catboost?

For this, you’ll need to calculate the first derivative and the second derivative of the loss function with respect to $\hat{y}$ .

Using the chain rule, the first derivative is

$\frac{\partial (y-\hat{y})^4}{\partial \hat{y}} = \frac{\partial (y-\hat{y})^4}{\partial (y-\hat{y})}*\frac{\partial y - \hat{y}}{\partial \hat{y}} = 4 * (y - \hat{y})^{3}* -1 = -4(y -\hat{y})^{3}$

And similarly using the chain rule, the second derivative comes out to be $12*(y-\hat{y})^2$

The catboost template for a custom objective is as follows –

class UserDefinedObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        """
        Computes first and second derivative of the loss function 
        with respect to the predicted value for each object.

        Parameters
        ----------
        approxes : indexed container of floats
            Current predictions for each object.

        targets : indexed container of floats
            Target values you provided with the dataset.

        weight : float, optional (default=None)
            Instance weight.

        Returns
        -------
            der1 : list-like object of float
            der2 : list-like object of float

        """
        pass

Using this temple, we can write the custom objective –

class CustomLossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        
        result = []
        n = len(targets)  # Number of samples

        for index in range(len(targets)):
            error = targets[index] - approxes[index]
            der1 = -4 * error**3
            der2 = 12 * error**2

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

Now let’s use this custom loss in our model

model = CatBoostRegressor(loss_function=CustomLossObjective(), n_estimators=100, eval_metric='RMSE')
model.fit(cb_pool)

predictions = model.predict(cb_pool)
mean_squared_error(y_true=data['target'], y_pred=predictions)

Using this loss, we see that the mean squared error is 0.735, this is clearly inferior to using RMSE, but as mentioned before the objective of this blog post is not to build the best model but to showcase how one can create a custom loss objective in catboost.

January 14, 2024

Create a Machine Learning Demo Using Gradio

Let’s build a Machine Learning Demo Using Gradio. We will be using the Titanic Dataset as an example.

Link to the Google Collab Notebook.

First, we will install and load the libraries.

!pip install -q gradio
!pip install -q catboost

import pandas as pd
import numpy as np
import seaborn as sns

sns.set_theme()

Then let’s load the data and analyse the data types.

df = sns.load_dataset("titanic")
df.dtypes

>>>
survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

We can see that we have the target as survived, the same is also as alive. Also, we will not be using all the features to create the model as we want to show how you can demo a Machine Learning model as a live predictor, so we don’t want to overburden the user.

# Identifying features, we are keeping very few features as we want to simulate this using gradio
features = [ 'pclass', 'sex', 'age', 'fare',
       'embarked']
target = 'survived'

Then we replace the missing values. For age, we will replace it with the median and for the embarked we will replace it with the most common value, which is S.

# Filling missing values
df['age'].fillna(np.nanquantile(df['age'], 0.5), inplace = True)
df['embarked'].fillna("S", inplace = True)

Now let’s build the model. No tuning, as getting the highest accuracy or f1-score is not objective.

from catboost import CatBoostClassifier
clf = CatBoostClassifier()

# Creating features and target
X = df[features]
y = df[target]

clf.fit(X,y, cat_features=['pclass', 'sex', 'embarked'])

The we write the function which takes in the inputs and returns an output of whether the passenger would’ve survived or not.

def predict(pclass:int = 3, 
            sex:str = "male", 
            age:float = 30, 
            fare:float = 100, 
            embarked:str = "S"):
  prediction_array = np.array([pclass, sex, age, fare, embarked])
  survived = clf.predict(prediction_array)
  if survived == 1:
    return f"The passenger survived"
  else:
    return f"The passenger did not survive"

Now for the gradio demo, we want to take in these inputs with different gradio components, pass those as inputs to the prediction function and display the output. I go into much more detail on how this is done in the YouTube video, but the code snippet has comments which will help in case you don’t want to watch the explainer video.

with gr.Blocks() as demo:
  # Keeping the three categorical feature input in the same row
  with gr.Row() as row1:
    pclass = gr.Dropdown(choices=[1,2,3], label= "pclass")
    sex = gr.Dropdown(choices =["male", "female"], label = "sex")
    embarked = gr.Dropdown(choices =["C", "Q", "S"], label = "embarked")
  # Creating slider for the two numerical inputs and also defining the limits for both
  age = gr.Slider(1,100, label = "age", interactive = True
  )
  fare = gr.Slider(10,600, label = "fare", interactive = True
  )

  submit = gr.Button(value = 'Predict')

  # Showing the output 
  output = gr.Textbox(label = "Whether the passenger survived ?", interactive = False,)

  # Defining what happens when the user clicks the submit button
  submit.click(predict, inputs = [pclass,sex, age,fare,embarked], outputs = [output])

demo.launch(share = False, debug = False)

Then you’ll get an output like this, where you’re free to play around with the features and see what the ML model’s output will be.

Let me know in case you want to build something else with Gradio or want me to cover any ML topic in the comments below.

July 29, 2023

Store Metadata within your model – Catboost
Often after training a ML model, you want to move it to production, sometimes you also want to store certain metadata about the model like its author, version, or other characteristics about it.

The first solution that comes to mind is creating a JSON or Python file within your repository and storing those there, but you can actually store metadata within the model file. I’ll show two examples of how you can do so in xgboost and catboost.

For catboost the code snippet is below, it is pretty straightforward.
```
from catboost import CatBoostRegressor

# Define your metadata
metadata = {
    'author': 'John Doe',
    'description': 'CatBoost model for classification',
    'version': '1.0',
    # Add any other metadata fields as needed
}

# Initialise your model with the added metadata
model = CatBoostRegressor(metadata=metadata)
# Fit the model on your data
model.fit(X,y)

model.save_model('/path/to/model.cbm')
```
Then you can load the model back using the below snippet and inspect the metadata stored within your model file.
```
model = CatBoostRegressor()
model.load_model('/path/to/model.cbm')

print(model.get_metadata()['author'])
>>> John Doe
```
For Xgboost, you can visit this link.
June 27, 2023

Using Custom Eval Metric with Catboost

Catboost offers a multitude of evaluation metrics. You can read all about them here, but often you want to use a custom evaluation metric.

For example in this ongoing Kaggle competition, the evaluation metric is Balanced Log Loss. Such a metric is not supported by catboost. By this I mean that you can’t simply write this and expect it to work.

from catboost import CatBoostClassifier
model = CatBoostClassifier(eval_metric="BalancedLogLoss")
model.fit(X,y)

This will give you an error. What you need to define is a custom eval metric class. The template for which is pretty simple.

class UserDefinedMetric(object):
    def is_max_optimal(self):
        # Returns whether great values of metric are better
        pass

    def evaluate(self, approxes, target, weight):
        # approxes is a list of indexed containers
        # (containers with only __len__ and __getitem__ defined),
        # one container per approx dimension.
        # Each container contains floats.
        # weight is a one dimensional indexed container.
        # target is a one dimensional indexed container.

        # weight parameter can be None.
        # Returns pair (error, weights sum)
        pass

    def get_final_error(self, error, weight):
        # Returns final value of metric based on error and weight
        pass

Here there are three parts to the class.

get_final_error – Here you can just return the error, or if you want to modify the error like take the log or square root, you can do so.
is_max_optimal – Here you return True if greater is better like accuracy etc, otherwise return False.
evaluate – Here lies the meat of your code where you’ll actually write what metric you want. Remember that the approxes are the predictions and you need to take approxes[0] as the output.

Below you will find the code for Balanced Log Loss as an eval metric.


class BalancedLogLoss:
    def get_final_error(self, error, weight):
        return error

    def is_max_optimal(self):
        return False

    def evaluate(self, approxes, target, weight):
        y_true = target.astype(int)
        y_pred = approxes[0].astype(float)
        
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        individual_loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        
        class_weights = np.where(y_true == 1, np.sum(y_true == 0) / np.sum(y_true == 1), np.sum(y_true == 1) / np.sum(y_true == 0))
        weighted_loss = individual_loss * class_weights
        
        balanced_logloss = np.mean(weighted_loss)
        
        return balanced_logloss, 0.0

Then you can simply call this in your grid search or randomised search like this –model = CatBoostClassifier(verbose = False,eval_metric=BalancedLogLoss())

Write in the comments below if you’ve any questions related to custom eval metrics in Catboost or any ML framework.

June 24, 2023

Category: Catboost

Creating a Custom Loss Function For Machine Learning Models

Create a Machine Learning Demo Using Gradio

Store Metadata within your model – Catboost

Using Custom Eval Metric with Catboost