Category: Metrics

  • Cohen’s Kappa and its use in ML

    Suppose you’re building a classification model on an imbalanced dataset and you want to have other measures for your model other than accuracy, F1-score, and ROC-AUC curve, what else can you measure to be confident in your results. The answer is Cohen’s kappa.

    Cohen’s Kappa is a statistical measure that quantifies the level of agreement between two annotators or, in the context of ML, the agreement between the model’s predictions and the true labels. It accounts for the possibility of agreement occurring by chance, providing a more nuanced evaluation than traditional accuracy metrics.

    The Formula:
    The formula for Cohen’s Kappa is –

    \kappa = \frac{p_{0} - p_{e}}{1-p_{e}}

    Where p_{0} is the observed agreement between the model’s predictions and true labels and p_{e} is the expected agreement by chance.

    Let’s take an example to understand this better. A binary classification scenario where you’re building a spam email classifier. The task is to distinguish between spam and non-spam (ham) emails. We’ll use a simple logistic regression model for this example.

    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, confusion_matrix, cohen_kappa_score

    # Sample data for spam and non-spam emails
    data = [
    ("Get rich quick! Claim your prize now!", "spam"),
    ("Meeting at 3 pm in the conference room.", "ham"),
    ("Exclusive offer for you!", "spam"),
    ("Reminder: Project deadline tomorrow.", "ham"),
    # ... more data ...
    ]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
    [text for text, label in data],
    [label for text, label in data],
    test_size=0.2,
    random_state=42
    )

    # Vectorize the text data
    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Train a logistic regression classifier
    classifier = LogisticRegression()
    classifier.fit(X_train_vec, y_train)

    # Make predictions on the test set
    y_pred = classifier.predict(X_test_vec)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    kappa_score = cohen_kappa_score(y_test, y_pred)

    # Print the results
    print(f"Accuracy: {accuracy}")
    print(f"Confusion Matrix:\n{conf_matrix}")
    print(f"Cohen's Kappa: {kappa_score}")

    After this, you get a kappa of 1, which means that it’s an excellent model and there is no variability that can be attributed to chance. Be aware that this is an ideal scenario.

    Another scenario is that you get a score of 0, meaning that the model’s performance is no better than random chance, that is your features don’t capture any meaningful patterns in the data.

    In the context of model evaluation:

    Kappa scores closer to 1 indicate a high level of agreement and are generally considered desirable.

    Kappa scores around 0 or below suggest poor agreement, and the model’s predictions might not be reliable.

    It’s essential to interpret Cohen’s Kappa alongside other evaluation metrics, such as accuracy, precision, recall, and the confusion matrix, to comprehensively understand the model’s performance. Additionally, the interpretation of Kappa may vary depending on the specific problem and the level of difficulty in the classification task.

  • MCC Score – The only ML metric you need

    The title might be a bit of a clickbait, but MCC (Matthews Correlation Coefficient) is a critical ML metric that every Data Scientist must know.

    Metrics like the F1 score focus on only one class and its performance, but if you want a balanced model then you should be optimising your model on MCC score rather than on Accuracy or F1-score.

    Let us take an example of –

    y_true = [1,1,1,1,0,0,1, 0,1]
    y_pred = [1,1,1,1,1,1,1, 0,0]

    If we calculate the F1-score then it is ~0.77, but the MCC score is ~0.19, meaning that even though the model is very good at classifying the positive class, it is not very good at the negative class.

    If we look at the formula for MCC –

    \[MCC = \frac{(TPxTN)-(FPxFN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\] https://www.hostmath.com/Math/MathJax.js?config=OK

    It should be clear that MCC gives equal focus on both TP and TN, since it is a correlation coefficient, its value ranges from -1 to 1.

  • ML Metrics | Top N Accuracy Explained

    This metric is usually used in multiclass classification problems.
    Each multiclass model gives a probability score for all the classes it is being trained on, but often you take the highest one, by using np.argmax but what if you took the top n classes and gave credit to the model if it got right in one of the n predictions.
    That is what is top n accuracy, it gives the model more chances to be right.

    Lets take an example.

    Suppose you built a model that predicts 3 classes and you want to find the top 2 accuracy of your model.
    Then you would pass the prediction array to the model and the true values and if the correct prediction is in the top 2 then you give it credit for being right.

    import numpy as np
    from sklearn.metrics import top_k_accuracy_score
    y_true = [0,1,1,2,2]
    y_pred = [[0.25, 0.2,0.3], #Here 0 is in the top 2
              [0.3, 0.35, 0.5], #Here 1 is in the top 2
              [0.2,0.4, 0.45], #Here 1 is in the top 2
              [0.5, 0.1, 0.2], #Here 2 is in the top 2
              [0.1, 0.4, 0.2]] #Here 2 is in the top 2
    top_k_accuracy_score(y_true, y_pred, k=2)
    

    It is 1.0, because the correct class was always in our top 2 prediction, actually, if you notice then it was always the second prediction of our model, so if we take regular accuracy or set the value k = 1 in top_k_accuracy_score(y_true, y_pred, k=2), the answer is 0.

    Hopefully, this explains what top N accuracy is, and if you want me to cover any ML topic, write in the comments below. Thanks for reading.

  • sMAPE vs MAPE vs RMSE, when to use which regression metric

    I was going through Kaggle competitions when this competition caught my eye, especially the evaluation metric for it. Now the usual metrics for forecasting or regression problems are either RMSE = \sqrt{\frac{\sum (y - \hat{y})^{2}}{N}} or MAPE = \frac{1}{N}\sum_{t=1}^{n}\frac{\left|A_{t}-F_{t}\right|}{\left|A_{t}\right|}, but sMAPE is different.

    SMAPE (Symmetric Mean Absolute Percentage Error) is a metric that is used to evaluate the accuracy of a forecast model. It is calculated as the average of the absolute percentage differences between the forecasted and actual values, with the percentage computed using the actual value as the base. Mathematically, it can be expressed as:

    sMAPE = \frac{100}{n}\sum_{t=1}^{n}\frac{\left|F_{t} - A_{t} \right|}{(\left|A_{t} \right| + \left|F_{t} \right|)/2}

    So when to use which metric ?

    • RMSE – When you want to penalize large outlier errors in your prediction model, RMSE is the metric of choice as it penalizes large errors more than smaller ones.
    • MAPE – All errors have to be treated equally, so in those cases MAPE makes sense to use
    • sMAPE – is typically used when the forecasted values and the actual values are both positive, and when the forecasts and actuals are of similar magnitudes. It is symmetric in that it treats over-forecasting and under-forecasting the same.

    It is important to note that in both MAPE and sMAPE, values of 0 are not allowed for both actual and forecast values as it would result in division by zero error.

  • Macro vs micro averages, how are they calculated ?

    In this post, I’ll go over macro and micro averages, namely precision, and recall.

    What is macro and micro averages ?

    A macro takes the measurement independently of each class and then takes the average, thus giving equal weight to each class whereas a micro will take the class imbalances into account when computing the average.

    When to use macro vs micro averages ?

    If you suspect class imbalances to be there, then micro average should be preferred to macro.

    How are they different ?

    Let’s take an example scenario from here.

    from sklearn.metrics import precision_score
    y_true = [0, 1, 2, 0, 1, 2]
    y_pred = [0, 2, 1, 0, 0, 1]
    precision_score(y_true, y_pred, average='macro')
    0.22...
    precision_score(y_true, y_pred, average='micro')
    0.33...
    

    You can see that the precision score is different for macro calculation vs micro calculation.

    Breaking down the calculation here in the confusion matrix

    A quick recap the the precision formula is for binary classification problem –

    Precision = \frac{TP}{TP+FP}



    For multi-class the micro and macro formula can be written as – Precision_{micro} = \frac{\sum TP_{i} }{\sum TP_{i}+\sum FP_{i}}



    Precision_{macro} = \frac{\sum PR_{i} }{n}



    So in the above example, the micro precision is

    Similarly the Precision for each class individually is

    P(0) = 2/3 = 0.66, P(1) = 0, P(2) = 0

    So macro precision is

    In this way the micro vs macro averages differ. Hope this article cleared your problems on macro vs micro averages in ML metrics.

  • When to use F2 or F0.5 score ? (F-beta score)

    Whenever we come across an imbalanced class problem, the metric to measure is often F1 score and not accuracy. A quick reminder that the F1 score is the harmonic mean of precision and recall.

    Precision is how accurate is your ML model in its predictions.

    Recall is a measure of the model’s ability to correctly identify the positive class.

    So the F1 score is a balanced measure of both recall and precision. But what if you want to prioritize reducing false positives or reducing false negatives, there comes F-beta. It’s a generalized metric, where a parameter beta is introduced to generalize the F-score.

    This enables one to choose an appropriate beta value to tune for the task at hand. If you want to minimize false positives, you want to increase the weight of precisions, so you should choose a value of beta less than 1, typically 0.5 is chosen and is called F0.5 score.

    Similarly, if you want to increase the importance of recall and reduce false negatives, you should choose a value of beta greater than 1, typically 2 is selected and is called F2 score.

    In a nutshell, you should optimize F2 score to reduce false negatives and F0.5 score to reduce false positives.