Category: ML

  • Leave one out encoding – Encode your categorical variables to the target

    In case you want to use ML models on categorical variables, you’ve to encode them. The most common approach is one hot encoding. But what if you’ve too many categories and categorical variables, in this case, if you one hot encode, then you will end up with a very sparse matrix.

    Well there are ways you can tackle this, and I’ll be talking about one such way – Leave One Out Encoding.

    Leave-One-Out (LOO) is a cross-validation technique that involves splitting the data into training and test sets, where the test set contains a single sample, and the training set contains the remaining samples. LOO is performed for each sample in the dataset, and the model is trained and evaluated multiple times. In each split, you take the mean of the target for the category being encoded in train and add it to the test.

    Pros:

    1. Utilizes all available data: LOO ensures that each sample in the dataset is used as both a training and test instance. This maximizes the utilization of the available data and provides a more accurate estimate of model performance.
    2. Low bias: Since each training set contains all but one sample, the model is trained on almost the entire dataset. This reduces the bias introduced by other cross-validation techniques that use smaller training sets.
    3. Suitable for small datasets: LOO is particularly useful for small datasets where splitting the data into multiple folds might result in inadequate training data for model fitting.
    4. Unbiased estimator: LOO estimates tend to have lower bias compared to other cross-validation techniques, as the model is evaluated on independent samples.

    Cons:

    1. High computational cost: LOO requires training and evaluating the model as many times as there are samples in the dataset, making it computationally expensive, especially for large datasets.
    2. Variance and instability: LOO estimates can have high variance due to the high dependence between the training sets. Small changes in the data can lead to significant changes in the estimated performance. Thus, LOO estimates can be less stable than estimates obtained from other cross-validation methods.
    3. Overfitting risk: LOO can be prone to overfitting, as the model is trained on almost the entire dataset. This can result in overly optimistic performance estimates if the model is too complex or the dataset is noisy.
    4. Imbalanced class issues: If the dataset is imbalanced, LOO can lead to biased estimates, as each training set will typically contain a majority of samples from the majority class.

    Let’s walk through an examples.

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import LeaveOneOut
    
    # Example data
    data = pd.DataFrame({
        'category': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'target': [1, 2, 3, 4, 5, 6, 7, 8]
    })
    
    # Create new column for leave-one-out encoded feature
    data['category_loo_encoded'] = np.nan
    

    Here we create a dummy data with a categorical variable and a numerical target.

    # Leave-One-Out Encoding
    loo = LeaveOneOut()
    
    for train_index, test_index in loo.split(data):
        X_train, X_test = data.iloc[train_index], data.iloc[test_index]
        
        # Calculate mean excluding the current row
        mean_target = X_train.loc[X_train['category'] == X_test['category'].values[0], 'target'].mean()
        
        # Assign leave-one-out encoded value
        data.loc[test_index, 'category_loo_encoded'] = mean_target
    
    # Display the result
    print(data)
    
    categorytargetcategory_loo_encoded
    A12
    A21
    B34.5
    B44
    B53.5
    C67.5
    C77
    C86.5

    There are also libraries that you can use which can help you with this. You can use category encoders. The advantage is that you can use parameters like sigma which adds noise and reduces overfitting.

    Here is the Python snippet on the same data.

    import category_encoders as ce
    # Create an instance of LeaveOneOutEncoder
    encoder = ce.LeaveOneOutEncoder(cols=['category'])
    
    # Perform leave-one-out encoding
    data_encoded = encoder.fit_transform(data['category'], data['target'])
    
    # Merge the encoded data with the original dataframe
    data = data.merge(data_encoded, how = 'left', left_index=True, right_index=True)
    
    # Display the result
    print(data)
    

    Here you can see we get the same result if we use category encoders as well.

    Thanks for reading and let me know in the comments in case you’ve any questions regarding Leave One Out Encoding.

  • Time Series Forecasting with Python Part 3 – Identifying Trends in Data

    While doing time series forecasting it is very important to analyse if your data has some trends, seasonality or periodicity in it. To identify if a time series has seasonality there are several techniques you can use.

    We will be using the following dummy data to see how we can test for seasonal trends in our data.

    sales = np.array([100, 120, 130, 150, 110, 130, 140, 160, 120, 140, 150, 170])
    
    quarters = ['Q1 2018', 'Q2 2018', 'Q3 2018', 'Q4 2018',
                'Q1 2019', 'Q2 2019', 'Q3 2019', 'Q4 2019',
                'Q1 2020', 'Q2 2020', 'Q3 2020', 'Q4 2020']
    
    1. Visual inspection – Just by looking at the plot of the time series, you can identify that there are visible patterns in it.

    In the image above you can clearly see that the sales grow from Q1 to Q3 and then decline in Q4 year on year.

    2. Autocorrelation Function (ACF) – Autocorrelation refers to the correlation of a series with itself at different time lags. In other words, it quantifies the similarity or relationship between a data point and its preceding or lagged observations. The ACF helps identify any repeating patterns or dependencies within the time series data.

    In the ACF plot, if we see spikes at regular lag intervals, it indicates seasonality. We can take the help of plot_acf from the statsmodels package.

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    from statsmodels.graphics.tsaplots import plot_acf
    
    # Generate ACF plot
    fig, ax = plt.subplots(figsize=(10, 6))
    plot_acf(sales, lags=11, ax=ax)  # Set lags to the number of quarters (12) minus 1
    
    plt.title('Autocorrelation Function (ACF) Plot')
    plt.xlabel('Lag')
    plt.ylabel('Autocorrelation')
    plt.show()
    

    Here we can clearly see a spike at 4, indicating what we already know that there is a seasonality present within the time series data.

    3. Decomposition –

    Decomposition is a technique used to break down a time series into its individual components: trend, seasonality, and residual (also known as error or noise). The decomposition process allows us to isolate and analyze these components separately, providing insights into the underlying patterns and variations within the time series data.

    There are two commonly used types of decomposition:

    1. Additive
    2. Multiplicative.
    1. Additive Decomposition: In additive decomposition, the time series is assumed to be the sum of its components. It is expressed as:Y(t) = Trend(t) + Seasonality(t) + Residual(t)
      The additive decomposition assumes that the magnitude of the seasonal fluctuations remains constant throughout the time series.
    2. Multiplicative Decomposition: In multiplicative decomposition, the time series is assumed to be the product of its components. It is expressed as:Y(t) = Trend(t) * Seasonality(t) * Residual(t)
      Multiplicative decomposition assumes that the seasonal fluctuations grow or shrink proportionally with the trend.

    Again we will be using the statsmodels package to perform seasonal decomposition.

    from statsmodels.tsa.seasonal import seasonal_decompose
    
    # Create a pandas Series with a quarterly frequency
    index = pd.date_range(start='2018-01-01', periods=len(sales), freq='Q')
    series = pd.Series(sales, index=index)
    
    # Perform seasonal decomposition
    decomposition = seasonal_decompose(series, model='additive')
    
    # Extract the components
    trend = decomposition.trend
    seasonality = decomposition.seasonal
    residuals = decomposition.resid
    
    # Plot the components
    plt.figure(figsize=(10, 8))
    plt.subplot(411)
    plt.plot(series, label='Original')
    plt.legend(loc='best')
    plt.subplot(412)
    plt.plot(trend, label='Trend')
    plt.legend(loc='best')
    plt.subplot(413)
    plt.plot(seasonality, label='Seasonality')
    plt.legend(loc='best')
    plt.subplot(414)
    plt.plot(residuals, label='Residuals')
    plt.legend(loc='best')
    plt.tight_layout()
    plt.show()
    

    In this dummy example, we can clearly see via this decomposition that there is an upwards trend in the data along with a quarterly seasonality.

    There are a couple more tests left to explore, but we will pick those up in the next part where we will continue to explore this seasonality and trends in time series data.

  • 10 Decision Tree Questions Every Data Scientist Needs to Know

    You may or may not be asked such questions in an interview, but often these kind of questions come up in screening tests which have MCQs.

  • Using Custom Eval Metric with Catboost

    Catboost offers a multitude of evaluation metrics. You can read all about them here, but often you want to use a custom evaluation metric.

    For example in this ongoing Kaggle competition, the evaluation metric is Balanced Log Loss. Such a metric is not supported by catboost. By this I mean that you can’t simply write this and expect it to work.

    from catboost import CatBoostClassifier
    model = CatBoostClassifier(eval_metric="BalancedLogLoss")
    model.fit(X,y)

    This will give you an error. What you need to define is a custom eval metric class. The template for which is pretty simple.

    class UserDefinedMetric(object):
        def is_max_optimal(self):
            # Returns whether great values of metric are better
            pass
    
        def evaluate(self, approxes, target, weight):
            # approxes is a list of indexed containers
            # (containers with only __len__ and __getitem__ defined),
            # one container per approx dimension.
            # Each container contains floats.
            # weight is a one dimensional indexed container.
            # target is a one dimensional indexed container.
    
            # weight parameter can be None.
            # Returns pair (error, weights sum)
            pass
    
        def get_final_error(self, error, weight):
            # Returns final value of metric based on error and weight
            pass
    
    

    Here there are three parts to the class.

    1. get_final_error – Here you can just return the error, or if you want to modify the error like take the log or square root, you can do so.
    2. is_max_optimal – Here you return True if greater is better like accuracy etc, otherwise return False.
    3. evaluate – Here lies the meat of your code where you’ll actually write what metric you want. Remember that the approxes are the predictions and you need to take approxes[0] as the output.

    Below you will find the code for Balanced Log Loss as an eval metric.

    
    class BalancedLogLoss:
        def get_final_error(self, error, weight):
            return error
    
        def is_max_optimal(self):
            return False
    
        def evaluate(self, approxes, target, weight):
            y_true = target.astype(int)
            y_pred = approxes[0].astype(float)
            
            y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
            individual_loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
            
            class_weights = np.where(y_true == 1, np.sum(y_true == 0) / np.sum(y_true == 1), np.sum(y_true == 1) / np.sum(y_true == 0))
            weighted_loss = individual_loss * class_weights
            
            balanced_logloss = np.mean(weighted_loss)
            
            return balanced_logloss, 0.0

    Then you can simply call this in your grid search or randomised search like this –model = CatBoostClassifier(verbose = False,eval_metric=BalancedLogLoss())

    Write in the comments below if you’ve any questions related to custom eval metrics in Catboost or any ML framework.

  • Balanced Log Loss, Metric for imbalanced classification problems

    We all know about LogLoss which is the main loss function when it comes to binary classification problems. The formula is given below –


    LogLoss = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)

    • (N) is the total number of samples.
    • (y_i) is the true label of sample (i) (0 or 1).
    • (p_i) is the predicted probability of sample (i) belonging to class 1.

    Imbalanced Log Loss:
    The imbalanced log loss accounts for class imbalance by introducing class weights. It can be defined as:


    ImbalancedLogLoss = -\frac{1}{N} \sum_{i=1}^{N} w_i \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)

    • (N) is the total number of samples.
    • (y_i) is the true label of sample (i) (0 or 1).
    • (p_i) is the predicted probability of sample (i) belonging to class 1.
    • (w_i) is the weight assigned to sample (i) based on its class label. For example, if class 0 has fewer samples than class 1, (w_i) can be set to the ratio of class 1 samples to class 0 samples.

    Here is the python code that you can call to evaluate in Catboost –

    class BalancedLogLoss:
        def get_final_error(self, error, weight):
            return error
    
        def is_max_optimal(self):
            return False
    
        def evaluate(self, approxes, target, weight):
            y_true = target.astype(int)
            y_pred = approxes[0].astype(float)
            
            y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
            individual_loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
            
            class_weights = np.where(y_true == 1, np.sum(y_true == 0) / np.sum(y_true == 1), np.sum(y_true == 1) / np.sum(y_true == 0))
            weighted_loss = individual_loss * class_weights
            
            balanced_logloss = np.mean(weighted_loss)
            
            return balanced_logloss, 0.0

    Advantages of Imbalanced LogLoss –

    1. Handles Class Imbalance: The imbalanced log loss takes into account the class distribution and assigns appropriate weights to each class. This allows the model to effectively handle imbalanced datasets, where one class may have significantly fewer samples than the other. By assigning higher weights to the minority class, the model focuses more on correctly classifying the minority class, reducing the impact of class imbalance.
    2. Improves Model Performance: By incorporating class weights in the loss function, the imbalanced log loss guides the model to optimize its predictions specifically for imbalanced datasets. This can lead to improved model performance, as the model becomes more sensitive to the minority class and learns to make better predictions for both classes.
    3. Flexible Weighting Strategies: The imbalanced log loss allows flexibility in assigning weights to different classes. Various weighting strategies can be used based on the characteristics of the dataset and the specific problem at hand. For example, weights can be inversely proportional to class frequencies or can be set manually based on domain knowledge. This flexibility enables the model to adapt to different levels of class imbalance and prioritize the correct classification of the minority class accordingly.
    4. Evaluation Metric Consistency: When using the imbalanced log loss as both the training loss and evaluation metric, it ensures consistency in model optimization and evaluation. By optimizing the model to minimize the imbalanced log loss during training, the model’s performance is directly aligned with the evaluation metric, providing a fair assessment of the model’s effectiveness in handling class imbalance.

    In conclusion, if you have an imbalanced class problem, you can try this eval metric in your models as well.

    .

  • MSE vs MSLE, When to use what metric?

    MSLE (Mean Squared Logarithmic Error) and MSE (Mean Squared Error) are both loss functions that you can use in regression problems. But when should you use what metric?

    Mean Squared Error (MSE):

    It is useful when your target has a normal or normal-like distribution, as it is sensitive to outliers.

    An example is below –

    In this case using MSE as your loss function makes much more sense than MSLE.

    Mean Squared Logarithmic Error (MSLE):

    • MSLE measures the average squared logarithmic difference between the predicted and actual values.
    • MSLE treats smaller errors as less significant than larger ones due to the logarithmic transformation.
    • It is less sensitive to outliers than MSE since the logarithmic transformation compresses the error values.

    An example where you can use MSLE –

    Here if you use MSE then due to the exponential nature of the target, it will be sensitive to outliers and MSLE is a better metric, remember that MSLE cannot be used for optimisation, it is only an evaluation metric.

    In general, the choice between MSLE and MSE depends on the nature of the problem, the distribution of errors, and the desired behavior of the model. It’s often a good idea to experiment with both and evaluate their performance using appropriate evaluation metrics before finalizing the choice.

  • The Making Of A Manager – An Insightful Read into people who are starting their journey as a leader

    Julie Zhou’s The Making Of A Manager does one thing and excels at driving home the point, that being a manager is more than assigning tasks to your team members and making sure that they adhere to the set deadlines.

    It is instead someone who has the responsibility that both the collective and individual goals are achieved. It is someone who understands the bigger picture and at times can also take the harder decisions. Being a manager also involves being straight in your feedback and it also involves a learning curve where you first have to manage yourself.

    The decisions taken by a manager in how he handles the growth of his team or which persons he brings into the team will dictate the direction that the team takes. A good manager will take a mediocre team and with the team make it into an outstanding one whereas a bad manager or someone in the wrong place at the wrong time will take a high-performing team and make it into a mediocre one.

    All the lessons in the book like balancing between not being a micro manager and being too distant from the team are nothing new and things we already know, the beauty of the book is in how it lays emphasis on these points and shines further clarity on them with some personal anecdotes.

    If you want to buy the book, you can do so from by visiting this link. It also supports the blog.

    Rating: 4 out of 5.
  • MCC Score – The only ML metric you need

    The title might be a bit of a clickbait, but MCC (Matthews Correlation Coefficient) is a critical ML metric that every Data Scientist must know.

    Metrics like the F1 score focus on only one class and its performance, but if you want a balanced model then you should be optimising your model on MCC score rather than on Accuracy or F1-score.

    Let us take an example of –

    y_true = [1,1,1,1,0,0,1, 0,1]
    y_pred = [1,1,1,1,1,1,1, 0,0]

    If we calculate the F1-score then it is ~0.77, but the MCC score is ~0.19, meaning that even though the model is very good at classifying the positive class, it is not very good at the negative class.

    If we look at the formula for MCC –

    \[MCC = \frac{(TPxTN)-(FPxFN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\] https://www.hostmath.com/Math/MathJax.js?config=OK

    It should be clear that MCC gives equal focus on both TP and TN, since it is a correlation coefficient, its value ranges from -1 to 1.

  • K-Nearest Neighbour Algorithm Explained

    KNN (K-Nearest Neighbours) is a supervised learning algorithm which uses the nearest neighbours to classify a new data point.

    The tricky part is selecting the optimal k for the model.

    sklearn.neighbors.KNeighborsClassifier(n_neighbors=5*weights='uniform'algorithm='auto'leaf_size=30p=2metric='minkowski'metric_params=Nonen_jobs=None)

    As you can see the weights by default is uniform and the n_neighbours is by default 5. Large values of k smooth things, but a very small value of k will be unreliable and could be affected by outliers.

    You can pick the optimal value of the k by tuning the hyperparameter using GridSearchCV.

    Then there is the value of p, which is by 2, meaning that it uses the euclidean distance, you can set it to 1 to use Manhattan distance. This is the distance it uses to chose the nearest points for classification.

    Let’s code this in python-

    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.datasets import load_iris
    from sklearn.model_selection import GridSearchCV
    
    X, y = load_iris()['data'], load_iris()['target']
    
    #defining the search grid
    param_grid = {'n_neighbors': np.arange(3,10,1),
                 'p': [1,2,3]}
    
    grid_search = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=param_grid, scoring='accuracy', cv = 3)
    
    grid_search.fit(X,y)
    
    print(grid_search.best_params_)
    >>> {'n_neighbors': 4, 'p': 2}
    print(grid_search.best_score_)
    >>>0.9866666666666667
    

    Hope this post cleared how you can use KNN in your machine learning problems, and if you want me to write about any ML topic, just drop a comment below.