Tag: categorical encoding

  • Leave one out encoding – Encode your categorical variables to the target

    In case you want to use ML models on categorical variables, you’ve to encode them. The most common approach is one hot encoding. But what if you’ve too many categories and categorical variables, in this case, if you one hot encode, then you will end up with a very sparse matrix.

    Well there are ways you can tackle this, and I’ll be talking about one such way – Leave One Out Encoding.

    Leave-One-Out (LOO) is a cross-validation technique that involves splitting the data into training and test sets, where the test set contains a single sample, and the training set contains the remaining samples. LOO is performed for each sample in the dataset, and the model is trained and evaluated multiple times. In each split, you take the mean of the target for the category being encoded in train and add it to the test.

    Pros:

    1. Utilizes all available data: LOO ensures that each sample in the dataset is used as both a training and test instance. This maximizes the utilization of the available data and provides a more accurate estimate of model performance.
    2. Low bias: Since each training set contains all but one sample, the model is trained on almost the entire dataset. This reduces the bias introduced by other cross-validation techniques that use smaller training sets.
    3. Suitable for small datasets: LOO is particularly useful for small datasets where splitting the data into multiple folds might result in inadequate training data for model fitting.
    4. Unbiased estimator: LOO estimates tend to have lower bias compared to other cross-validation techniques, as the model is evaluated on independent samples.

    Cons:

    1. High computational cost: LOO requires training and evaluating the model as many times as there are samples in the dataset, making it computationally expensive, especially for large datasets.
    2. Variance and instability: LOO estimates can have high variance due to the high dependence between the training sets. Small changes in the data can lead to significant changes in the estimated performance. Thus, LOO estimates can be less stable than estimates obtained from other cross-validation methods.
    3. Overfitting risk: LOO can be prone to overfitting, as the model is trained on almost the entire dataset. This can result in overly optimistic performance estimates if the model is too complex or the dataset is noisy.
    4. Imbalanced class issues: If the dataset is imbalanced, LOO can lead to biased estimates, as each training set will typically contain a majority of samples from the majority class.

    Let’s walk through an examples.

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import LeaveOneOut
    
    # Example data
    data = pd.DataFrame({
        'category': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'target': [1, 2, 3, 4, 5, 6, 7, 8]
    })
    
    # Create new column for leave-one-out encoded feature
    data['category_loo_encoded'] = np.nan
    

    Here we create a dummy data with a categorical variable and a numerical target.

    # Leave-One-Out Encoding
    loo = LeaveOneOut()
    
    for train_index, test_index in loo.split(data):
        X_train, X_test = data.iloc[train_index], data.iloc[test_index]
        
        # Calculate mean excluding the current row
        mean_target = X_train.loc[X_train['category'] == X_test['category'].values[0], 'target'].mean()
        
        # Assign leave-one-out encoded value
        data.loc[test_index, 'category_loo_encoded'] = mean_target
    
    # Display the result
    print(data)
    
    categorytargetcategory_loo_encoded
    A12
    A21
    B34.5
    B44
    B53.5
    C67.5
    C77
    C86.5

    There are also libraries that you can use which can help you with this. You can use category encoders. The advantage is that you can use parameters like sigma which adds noise and reduces overfitting.

    Here is the Python snippet on the same data.

    import category_encoders as ce
    # Create an instance of LeaveOneOutEncoder
    encoder = ce.LeaveOneOutEncoder(cols=['category'])
    
    # Perform leave-one-out encoding
    data_encoded = encoder.fit_transform(data['category'], data['target'])
    
    # Merge the encoded data with the original dataframe
    data = data.merge(data_encoded, how = 'left', left_index=True, right_index=True)
    
    # Display the result
    print(data)
    

    Here you can see we get the same result if we use category encoders as well.

    Thanks for reading and let me know in the comments in case you’ve any questions regarding Leave One Out Encoding.

  • Weight of Evidence Encoding

    So today I was participating in this Kaggle competition and the data had too many categorical variables. One way to build a model with too many categorical variables is to use a model like Catboost and let it deal with encoding categorical variables. But I wanted to ensemble my results with an Xgboost model, so I had to encode them. Using the weight of evidence encoding, I got a solution which was a top 10 solution when submitted. I have made the notebook public, you can go here and see it.

    So what is weight of evidence ?

    To put it simply –

    woe = ln(\frac{percnegatives}{percpositives}) = ln(\frac{\frac{neggroup}{totalneg}}{\frac{posgroup}{totalpos}})

    I’ve gone through an example explaining the weight of evidence in the youtube video below.