Category: Gradient Boosting

Custom Objective Function in XGBoost

In the previous post, we covered how you can create a custom loss function in Catboost, but you might be using catboost, so how can you create the same if you’re using Xgboost to train your models. In this post, I’ll walk over an example using the famous Titanic dataset, where we’ll recreate the LogLoss function and compare the results with the standard implementation in the library.

First, we have to set up the data.

import numpy as np 
import seaborn as sns
import pandas as pd
import xgboost as xgb
from sklearn.metrics import log_loss

data = sns.load_dataset('titanic')

Then some data cleaning and setting up the training dataset. The goal is not to get the best model but to demonstrate the custom loss function, so not much feature engineering is being done.

data['embarked'].fillna('S', inplace = True)

X,y = data[[c for c in data.columns if c not in  \
            ['survived', 'alive', 'deck', 'embark_town']]], \
      data['survived']

cat_columns = ['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'class',
       'who', 'adult_male', 'alone']

X = pd.get_dummies(X, columns=cat_columns, drop_first=True)

Let’s say there was no loss function like logloss, then how would you define the logloss as an objective function.

$LogLoss = -1/N \sum({y_{i}log(\hat{y}) + (1-y_{i})log(1-\hat{y})})$

You’ll have to calculate the first and second derivative with respect to the $\hat{y}$

$\Large \frac{\partial LogLoss}{\partial \hat{y}} = -\frac{y_{i}}{\hat{y}} + \frac{1-y_{i}}{1-\hat{y}}$

$\Large \frac{\partial^2LogLoss }{\partial \hat{y}^2} = \frac{y_{i}}{\hat{y}^{2}} + \frac{1-y_{i}}{(1-\hat{y})^{2}}$

Now we will write these up as Python functions and create a function that returns the gradient and hessian (second derivative) values. In the xgboost library, the first value being passed is the predictions and the second is the training matrix.

def log_loss_derivative(y_pred, dtrain ):
    y = dtrain.get_label()
    return (-y/y_pred) + ((1-y)/(1-y_pred))

def log_loss_second_derivative(y_pred,  dtrain):
    y = dtrain.get_label()
    return (y/np.power(y_pred,2)) + ((1-y)/np.power((1-y_pred),2))

def custom_log_loss(predt, dtrain):
    y_pred = np.clip(predt, a_max=1-1e-5, a_min=1e-5)
    grad = log_loss_derivative(y_pred= y_pred, dtrain = dtrain)
    hess = log_loss_second_derivative(y_pred= y_pred, dtrain = dtrain)
    return grad, hess

We clip the predictions to avoid division by zero errors. Now let’s train.

import xgboost as xgb

dtrain =xgb.DMatrix(data=X, label=y)

model = xgb.train({'tree_method': 'hist', 'seed': 1994},
           dtrain=dtrain,
           num_boost_round=10,
           obj=custom_log_loss)

log_loss(y_pred=np.clip(model.predict(dtrain), a_max=1, a_min=0), y_true=y)
>>>0.24912

Comparison with the standard implementation.

clf = xgb.XGBClassifier(n_estimators = 10, **{'tree_method': 'hist', 'seed': 1994})
clf.fit(X,y)

log_loss(y_pred=np.clip(clf.predict_proba(X)[:,1], a_max=1, a_min=0), y_true=y)

>>>0.2861

As we can see the metrics are very close in our implementation of the LogLoss and the standard implementation. Of course, you should use the standard implementation when it’s available, but in case you want to use a custom loss function, you now know how to do so.

January 21, 2024

5 Essential Boosting Parameters You Should Be Tuning
Here are the 5 essential hyper-parameters that you should be always tuning when building any boosting model, whether you’re using XgBoost, LightGBM or even CatBoost.
1. n_estimators – It is not the number of trees that the boosting algorithm will grow, but as the name suggests, the number of times gradient boosting will occur, so if you are using a tree-based boosting algorithm, then if you make this number 5, then each round of boosting fits a single tree to the negative gradient of some loss function.
2. max_depth – The depth of each tree, pretty simple, the higher this number, the stronger each learner is in the model and the more your model can overfit. So pretty important to tune.
3. learning_rate – Again a very important param, the higher it is the faster your algorithm will converge to the local minima, but too high and it might overshoot the minima, too low and it might never reach the minima.
4. subsample – Sample of the training data to be used in each boosting round, if you use 0.5, then xgboost will randomly sample half your training data in each boosting iteration before growing the tree. Important if you want to control overfitting.
5. colsample_bytree – Fraction of columns to use when growing a tree, again if set to 0.5, xgboost will randomly sample half of your features to grow the tree in each boosting round. Again very important to control overfitting.
In another post I’ll be going over another 5 essential hyper-parameters that you should be tuning.
January 26, 2023
Weight of Evidence Encoding

So today I was participating in this Kaggle competition and the data had too many categorical variables. One way to build a model with too many categorical variables is to use a model like Catboost and let it deal with encoding categorical variables. But I wanted to ensemble my results with an Xgboost model, so I had to encode them. Using the weight of evidence encoding, I got a solution which was a top 10 solution when submitted. I have made the notebook public, you can go here and see it.

So what is weight of evidence ?

To put it simply –

$woe = ln(\frac{percnegatives}{percpositives}) = ln(\frac{\frac{neggroup}{totalneg}}{\frac{posgroup}{totalpos}})$

I’ve gone through an example explaining the weight of evidence in the youtube video below.

January 18, 2023

Category: Gradient Boosting

Custom Objective Function in XGBoost

5 Essential Boosting Parameters You Should Be Tuning

Weight of Evidence Encoding

So what is weight of evidence ?