Category: Catboost

  • Creating a Custom Loss Function For Machine Learning Models

    While standard Machine Learning Libraries provide a vast array of loss functions out of the box, sometimes we need to create our own custom loss function. In this blog post, I’ll go over a simple example and create a custom loss function in Catboost.

    First we will create the data for training.

    # Importing libraries
    import numpy as np
    import pandas as pd
    from sklearn.metrics import mean_squared_error
    from catboost import CatBoostRegressor, Pool
    from sklearn.datasets import fetch_california_housing

    raw_data = fetch_california_housing()

    data = pd.concat([pd.DataFrame(raw_data['data'], columns=raw_data['feature_names']),
    pd.Series(raw_data['target'], name = 'target')], axis = 1)

    features = [i for i in data.columns.tolist() if i != 'target']

    Since the objective is not to create the best model possible, we won’t be doing any feature engineering. Let’s use catboost, and create a model using standard loss functions.

    model = CatBoostRegressor(loss_function='RMSE', n_estimators=100, eval_metric='RMSE')

    cb_pool = Pool(data=data[features], label=data['target'], feature_names=features)

    model.fit(cb_pool)

    predictions = model.predict(cb_pool)

    mean_squared_error(y_true=data['target'], y_pred=predictions)

    Upon evaluating the model we find that the mean squared error is 0.15. Definitely a model which is overfitting, but that’s not a concern for this tutorial.

    But what is you don’t want to use RMSE as a loss function, and instead want to use something like this –

    loss = \frac{\sum (y - \hat{y})^{4}}{n}

    Then how do you create a loss function in catboost?

    For this, you’ll need to calculate the first derivative and the second derivative of the loss function with respect to \hat{y}.

    Using the chain rule, the first derivative is

    \frac{\partial (y-\hat{y})^4}{\partial \hat{y}} = \frac{\partial (y-\hat{y})^4}{\partial (y-\hat{y})}*\frac{\partial y - \hat{y}}{\partial \hat{y}} = 4 * (y - \hat{y})^{3}* -1 = -4(y -\hat{y})^{3}

    And similarly using the chain rule, the second derivative comes out to be 12*(y-\hat{y})^2

    The catboost template for a custom objective is as follows –

    class UserDefinedObjective(object):
        def calc_ders_range(self, approxes, targets, weights):
            """
            Computes first and second derivative of the loss function 
            with respect to the predicted value for each object.
    
            Parameters
            ----------
            approxes : indexed container of floats
                Current predictions for each object.
    
            targets : indexed container of floats
                Target values you provided with the dataset.
    
            weight : float, optional (default=None)
                Instance weight.
    
            Returns
            -------
                der1 : list-like object of float
                der2 : list-like object of float
    
            """
            pass
    

    Using this temple, we can write the custom objective –

    class CustomLossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
    assert len(approxes) == len(targets)
    if weights is not None:
    assert len(weights) == len(approxes)

    result = []
    n = len(targets) # Number of samples

    for index in range(len(targets)):
    error = targets[index] - approxes[index]
    der1 = -4 * error**3
    der2 = 12 * error**2

    if weights is not None:
    der1 *= weights[index]
    der2 *= weights[index]

    result.append((der1, der2))
    return result

    Now let’s use this custom loss in our model

    model = CatBoostRegressor(loss_function=CustomLossObjective(), n_estimators=100, eval_metric='RMSE')
    model.fit(cb_pool)

    predictions = model.predict(cb_pool)
    mean_squared_error(y_true=data['target'], y_pred=predictions)

    Using this loss, we see that the mean squared error is 0.735, this is clearly inferior to using RMSE, but as mentioned before the objective of this blog post is not to build the best model but to showcase how one can create a custom loss objective in catboost.

  • Create a Machine Learning Demo Using Gradio

    Create a Machine Learning Demo Using Gradio

    Let’s build a Machine Learning Demo Using Gradio. We will be using the Titanic Dataset as an example.

    Link to the Google Collab Notebook.

    First, we will install and load the libraries.

    !pip install -q gradio
    !pip install -q catboost
    
    import pandas as pd
    import numpy as np
    import seaborn as sns
    
    sns.set_theme()

    Then let’s load the data and analyse the data types.

    df = sns.load_dataset("titanic")
    df.dtypes
    
    >>>
    survived          int64
    pclass            int64
    sex              object
    age             float64
    sibsp             int64
    parch             int64
    fare            float64
    embarked         object
    class          category
    who              object
    adult_male         bool
    deck           category
    embark_town      object
    alive            object
    alone              bool
    dtype: object

    We can see that we have the target as survived, the same is also as alive. Also, we will not be using all the features to create the model as we want to show how you can demo a Machine Learning model as a live predictor, so we don’t want to overburden the user.

    # Identifying features, we are keeping very few features as we want to simulate this using gradio
    features = [ 'pclass', 'sex', 'age', 'fare',
           'embarked']
    target = 'survived'

    Then we replace the missing values. For age, we will replace it with the median and for the embarked we will replace it with the most common value, which is S.

    # Filling missing values
    df['age'].fillna(np.nanquantile(df['age'], 0.5), inplace = True)
    df['embarked'].fillna("S", inplace = True)

    Now let’s build the model. No tuning, as getting the highest accuracy or f1-score is not objective.

    from catboost import CatBoostClassifier
    clf = CatBoostClassifier()
    
    # Creating features and target
    X = df[features]
    y = df[target]
    
    clf.fit(X,y, cat_features=['pclass', 'sex', 'embarked'])

    The we write the function which takes in the inputs and returns an output of whether the passenger would’ve survived or not.

    def predict(pclass:int = 3, 
                sex:str = "male", 
                age:float = 30, 
                fare:float = 100, 
                embarked:str = "S"):
      prediction_array = np.array([pclass, sex, age, fare, embarked])
      survived = clf.predict(prediction_array)
      if survived == 1:
        return f"The passenger survived"
      else:
        return f"The passenger did not survive"

    Now for the gradio demo, we want to take in these inputs with different gradio components, pass those as inputs to the prediction function and display the output. I go into much more detail on how this is done in the YouTube video, but the code snippet has comments which will help in case you don’t want to watch the explainer video.

    with gr.Blocks() as demo:
      # Keeping the three categorical feature input in the same row
      with gr.Row() as row1:
        pclass = gr.Dropdown(choices=[1,2,3], label= "pclass")
        sex = gr.Dropdown(choices =["male", "female"], label = "sex")
        embarked = gr.Dropdown(choices =["C", "Q", "S"], label = "embarked")
      # Creating slider for the two numerical inputs and also defining the limits for both
      age = gr.Slider(1,100, label = "age", interactive = True
      )
      fare = gr.Slider(10,600, label = "fare", interactive = True
      )
    
      submit = gr.Button(value = 'Predict')
    
      # Showing the output 
      output = gr.Textbox(label = "Whether the passenger survived ?", interactive = False,)
    
      # Defining what happens when the user clicks the submit button
      submit.click(predict, inputs = [pclass,sex, age,fare,embarked], outputs = [output])
    
    demo.launch(share = False, debug = False)

    Then you’ll get an output like this, where you’re free to play around with the features and see what the ML model’s output will be.

    Let me know in case you want to build something else with Gradio or want me to cover any ML topic in the comments below.

  • Using Custom Eval Metric with Catboost

    Catboost offers a multitude of evaluation metrics. You can read all about them here, but often you want to use a custom evaluation metric.

    For example in this ongoing Kaggle competition, the evaluation metric is Balanced Log Loss. Such a metric is not supported by catboost. By this I mean that you can’t simply write this and expect it to work.

    from catboost import CatBoostClassifier
    model = CatBoostClassifier(eval_metric="BalancedLogLoss")
    model.fit(X,y)

    This will give you an error. What you need to define is a custom eval metric class. The template for which is pretty simple.

    class UserDefinedMetric(object):
        def is_max_optimal(self):
            # Returns whether great values of metric are better
            pass
    
        def evaluate(self, approxes, target, weight):
            # approxes is a list of indexed containers
            # (containers with only __len__ and __getitem__ defined),
            # one container per approx dimension.
            # Each container contains floats.
            # weight is a one dimensional indexed container.
            # target is a one dimensional indexed container.
    
            # weight parameter can be None.
            # Returns pair (error, weights sum)
            pass
    
        def get_final_error(self, error, weight):
            # Returns final value of metric based on error and weight
            pass
    
    

    Here there are three parts to the class.

    1. get_final_error – Here you can just return the error, or if you want to modify the error like take the log or square root, you can do so.
    2. is_max_optimal – Here you return True if greater is better like accuracy etc, otherwise return False.
    3. evaluate – Here lies the meat of your code where you’ll actually write what metric you want. Remember that the approxes are the predictions and you need to take approxes[0] as the output.

    Below you will find the code for Balanced Log Loss as an eval metric.

    
    class BalancedLogLoss:
        def get_final_error(self, error, weight):
            return error
    
        def is_max_optimal(self):
            return False
    
        def evaluate(self, approxes, target, weight):
            y_true = target.astype(int)
            y_pred = approxes[0].astype(float)
            
            y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
            individual_loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
            
            class_weights = np.where(y_true == 1, np.sum(y_true == 0) / np.sum(y_true == 1), np.sum(y_true == 1) / np.sum(y_true == 0))
            weighted_loss = individual_loss * class_weights
            
            balanced_logloss = np.mean(weighted_loss)
            
            return balanced_logloss, 0.0

    Then you can simply call this in your grid search or randomised search like this –model = CatBoostClassifier(verbose = False,eval_metric=BalancedLogLoss())

    Write in the comments below if you’ve any questions related to custom eval metrics in Catboost or any ML framework.