Tag: Data Science Interview

  • sMAPE vs MAPE vs RMSE, when to use which regression metric

    I was going through Kaggle competitions when this competition caught my eye, especially the evaluation metric for it. Now the usual metrics for forecasting or regression problems are either RMSE = \sqrt{\frac{\sum (y - \hat{y})^{2}}{N}} or MAPE = \frac{1}{N}\sum_{t=1}^{n}\frac{\left|A_{t}-F_{t}\right|}{\left|A_{t}\right|}, but sMAPE is different.

    SMAPE (Symmetric Mean Absolute Percentage Error) is a metric that is used to evaluate the accuracy of a forecast model. It is calculated as the average of the absolute percentage differences between the forecasted and actual values, with the percentage computed using the actual value as the base. Mathematically, it can be expressed as:

    sMAPE = \frac{100}{n}\sum_{t=1}^{n}\frac{\left|F_{t} - A_{t} \right|}{(\left|A_{t} \right| + \left|F_{t} \right|)/2}

    So when to use which metric ?

    • RMSE – When you want to penalize large outlier errors in your prediction model, RMSE is the metric of choice as it penalizes large errors more than smaller ones.
    • MAPE – All errors have to be treated equally, so in those cases MAPE makes sense to use
    • sMAPE – is typically used when the forecasted values and the actual values are both positive, and when the forecasts and actuals are of similar magnitudes. It is symmetric in that it treats over-forecasting and under-forecasting the same.

    It is important to note that in both MAPE and sMAPE, values of 0 are not allowed for both actual and forecast values as it would result in division by zero error.

  • Macro vs micro averages, how are they calculated ?

    In this post, I’ll go over macro and micro averages, namely precision, and recall.

    What is macro and micro averages ?

    A macro takes the measurement independently of each class and then takes the average, thus giving equal weight to each class whereas a micro will take the class imbalances into account when computing the average.

    When to use macro vs micro averages ?

    If you suspect class imbalances to be there, then micro average should be preferred to macro.

    How are they different ?

    Let’s take an example scenario from here.

    from sklearn.metrics import precision_score
    y_true = [0, 1, 2, 0, 1, 2]
    y_pred = [0, 2, 1, 0, 0, 1]
    precision_score(y_true, y_pred, average='macro')
    0.22...
    precision_score(y_true, y_pred, average='micro')
    0.33...
    

    You can see that the precision score is different for macro calculation vs micro calculation.

    Breaking down the calculation here in the confusion matrix

    A quick recap the the precision formula is for binary classification problem –

    Precision = \frac{TP}{TP+FP}



    For multi-class the micro and macro formula can be written as – Precision_{micro} = \frac{\sum TP_{i} }{\sum TP_{i}+\sum FP_{i}}



    Precision_{macro} = \frac{\sum PR_{i} }{n}



    So in the above example, the micro precision is

    Similarly the Precision for each class individually is

    P(0) = 2/3 = 0.66, P(1) = 0, P(2) = 0

    So macro precision is

    In this way the micro vs macro averages differ. Hope this article cleared your problems on macro vs micro averages in ML metrics.

  • ReLU vs Leaky ReLU, when to use what

    ReLU (Rectified Linear Unit) and Leaky ReLU are both types of activation functions used in neural networks.

    ReLU

    ReLU is defined as f(x) = max(0, x), where x is the input to the function. It sets all negative input values to zero while allowing all non-negative values to pass through unchanged. This can help speed up training and improve the performance of the model because it reduces the number of negative input values that need to be processed by the model.

    Leaky ReLU is an extension of ReLU that aims to address the problem of “dying ReLUs” in which some neurons in the network never activate because the gradient is zero for all input values less than zero. It can be defined mathematically as f(x) = max(x, kx) where k is usually a small negative slope (of 0.01 or so) for negative input values, rather than being zero as in a standard ReLU.

    In practice, LeakyReLU is being used as a generalization of ReLU. This small negative slope helps in avoiding the dying ReLU problem. Also, it helps to train a network faster as the gradients for negative input values will not be zero. A general rule of thumb when choosing between the two would be that, if the problem does not have sparse inputs and the data set is not too small, using Leaky ReLU may result in a more accurate model. Otherwise, if the problem has sparse inputs and/or the data set is small, then using ReLU is a better choice.

    It also depends on personal preferences and what the dataset is like. Sometimes leaky ReLU may work better in some cases and sometimes ReLU may be better. It’s important to try out different activation functions and see which one gives the best performance on your dataset.

  • Why Tanh is a better activation function than sigmoid ?

    You might be asked that why often in neural networks, tanh is considered to be a better activation function than sigmoid.

    Sigmoid
    Tanh

    Andrew NG also mentions in his deep learning specialization course that tanh is almost always a better activation function than sigmoid. So why is that the case?

    There are a few reasons why the hyperbolic tangent (tanh) function is often considered to be a better activation function than the sigmoid function:

    1. The output of the tanh function is centered around zero, which means that the negative inputs will be mapped to negative values and the positive inputs will be mapped to positive values. This makes the learning process of the network more stable.
    2. The tanh function has a derivative that is well-behaved, meaning that it is relatively easy to compute and the gradient is relatively stable.
    3. The sigmoid function, on the other hand, maps all inputs to values between 0 and 1. This can cause the network to become saturated, and the gradients can become very small, making the network difficult to train.
    4. Another reason is that the range of the tanh function is [-1,1], while the range of the sigmoid function is [0,1], which makes the model output values more similar to the standard normal distribution. It will be less likely to saturate and more likely to have a good gradient flow, leading to faster convergence.
    5. Another advantage is that the tanh function is differentiable at all points. In contrast, the sigmoid function has a kink at 0, which may cause issues when computing gradients during back-propagation, and other optimization methods.

    All that being said, whether to use sigmoid or tanh depends on the specific problem and context, and it’s not always the case that one is clearly better than the other.

  • When to use F2 or F0.5 score ? (F-beta score)

    Whenever we come across an imbalanced class problem, the metric to measure is often F1 score and not accuracy. A quick reminder that the F1 score is the harmonic mean of precision and recall.

    Precision is how accurate is your ML model in its predictions.

    Recall is a measure of the model’s ability to correctly identify the positive class.

    So the F1 score is a balanced measure of both recall and precision. But what if you want to prioritize reducing false positives or reducing false negatives, there comes F-beta. It’s a generalized metric, where a parameter beta is introduced to generalize the F-score.

    This enables one to choose an appropriate beta value to tune for the task at hand. If you want to minimize false positives, you want to increase the weight of precisions, so you should choose a value of beta less than 1, typically 0.5 is chosen and is called F0.5 score.

    Similarly, if you want to increase the importance of recall and reduce false negatives, you should choose a value of beta greater than 1, typically 2 is selected and is called F2 score.

    In a nutshell, you should optimize F2 score to reduce false negatives and F0.5 score to reduce false positives.

  • Oddball Data Science Interview Questions on Decision Trees

    Decision trees are the building blocks of most ML models. So often questions regarding Decision Trees are asked in data science interviews. In this post, I’ll try to cover some questions which are asked during data science interviews but often catch people by surprise.

    Are Decision Trees Parametric or non-parametric models ?

    Decision Trees are non-parametric models. Linear Regression and Logistic regression are examples of parametric models

    Why is Gini Index preferred way of growing decision trees than Entropy in Machine Learning libraries ?

    The calculation for Gini Index is computationally more effecient than that for Entropy.

    It’s because of this reason that it is the preferred way

    How are continouis variables handled as predictor variables in decision trees ?

    Continuous or numerical variables are binned and then used for splitting a node in Decision tree

    What is optimised in case the target is a continuous variable or when the task is Regression ?

    Variance reduction is used to choose the best split when the target is continuous.

    How do decision trees handle multiple classes or in other words does multi-class classification ?

    The split is done on Information gain like in case of binary classifier using Gini or Entropy. In the leaf where no further splits are possible, the class having the highest probability is the predicted class. You can even return the probability as well.

  • Null Hypothesis of Linear Regression Explained

    Ever wondered why we look for p-value less than 0.05 for the coefficients when looking at the linear regression results.

    Let’s quickly recap the basics of linear regression. In Linear Regression we try to estimate a best fit line for given data points. In case we have only one predictor variable and a target the linear equation will look something like

    Y = A + Bx

    Here A being the intercept and B being the slope or coefficient.

    The null hypothesis for linear regression is that B=0 and the alternate hypothesis is that B != 0.

    This is the reason why we look for p-value < 0.05 to reject the null hypothesis and establish that there exists a relationship between the target and the predictor variable.