categorical encoding – ML EXPLAINED

In case you want to use ML models on categorical variables, you’ve to encode them. The most common approach is one hot encoding. But what if you’ve too many categories and categorical variables, in this case, if you one hot encode, then you will end up with a very sparse matrix.

Well there are ways you can tackle this, and I’ll be talking about one such way – Leave One Out Encoding.

Leave-One-Out (LOO) is a cross-validation technique that involves splitting the data into training and test sets, where the test set contains a single sample, and the training set contains the remaining samples. LOO is performed for each sample in the dataset, and the model is trained and evaluated multiple times. In each split, you take the mean of the target for the category being encoded in train and add it to the test.

Pros:

Utilizes all available data: LOO ensures that each sample in the dataset is used as both a training and test instance. This maximizes the utilization of the available data and provides a more accurate estimate of model performance.
Low bias: Since each training set contains all but one sample, the model is trained on almost the entire dataset. This reduces the bias introduced by other cross-validation techniques that use smaller training sets.
Suitable for small datasets: LOO is particularly useful for small datasets where splitting the data into multiple folds might result in inadequate training data for model fitting.
Unbiased estimator: LOO estimates tend to have lower bias compared to other cross-validation techniques, as the model is evaluated on independent samples.

Cons:

High computational cost: LOO requires training and evaluating the model as many times as there are samples in the dataset, making it computationally expensive, especially for large datasets.
Variance and instability: LOO estimates can have high variance due to the high dependence between the training sets. Small changes in the data can lead to significant changes in the estimated performance. Thus, LOO estimates can be less stable than estimates obtained from other cross-validation methods.
Overfitting risk: LOO can be prone to overfitting, as the model is trained on almost the entire dataset. This can result in overly optimistic performance estimates if the model is too complex or the dataset is noisy.
Imbalanced class issues: If the dataset is imbalanced, LOO can lead to biased estimates, as each training set will typically contain a majority of samples from the majority class.

Let’s walk through an examples.

import pandas as pd
import numpy as np
from sklearn.model_selection import LeaveOneOut

# Example data
data = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'target': [1, 2, 3, 4, 5, 6, 7, 8]
})

# Create new column for leave-one-out encoded feature
data['category_loo_encoded'] = np.nan

Here we create a dummy data with a categorical variable and a numerical target.

# Leave-One-Out Encoding
loo = LeaveOneOut()

for train_index, test_index in loo.split(data):
    X_train, X_test = data.iloc[train_index], data.iloc[test_index]
    
    # Calculate mean excluding the current row
    mean_target = X_train.loc[X_train['category'] == X_test['category'].values[0], 'target'].mean()
    
    # Assign leave-one-out encoded value
    data.loc[test_index, 'category_loo_encoded'] = mean_target

# Display the result
print(data)

category	target	category_loo_encoded
A	1	2
A	2	1
B	3	4.5
B	4	4
B	5	3.5
C	6	7.5
C	7	7
C	8	6.5

There are also libraries that you can use which can help you with this. You can use category encoders. The advantage is that you can use parameters like sigma which adds noise and reduces overfitting.

Here is the Python snippet on the same data.

import category_encoders as ce
# Create an instance of LeaveOneOutEncoder
encoder = ce.LeaveOneOutEncoder(cols=['category'])

# Perform leave-one-out encoding
data_encoded = encoder.fit_transform(data['category'], data['target'])

# Merge the encoded data with the original dataframe
data = data.merge(data_encoded, how = 'left', left_index=True, right_index=True)

# Display the result
print(data)

Here you can see we get the same result if we use category encoders as well.

Thanks for reading and let me know in the comments in case you’ve any questions regarding Leave One Out Encoding.

Tag: categorical encoding

Leave one out encoding – Encode your categorical variables to the target

Weight of Evidence Encoding

So what is weight of evidence ?

$woe = ln(\frac{percnegatives}{percpositives}) = ln(\frac{\frac{neggroup}{totalneg}}{\frac{posgroup}{totalpos}})$