Tag: Statistics

  • Correlation between numerical and categorical variable – Point Biserial Correlation

    We all know about Pearson correlation among numerical variables. But what if your target is binary and you want to calculate the correlation between numerical features and binary target. Well, you can do so using point-biserial correlation.

    The point-biserial correlation coefficient is a statistical measure that quantifies the relationship between a continuous variable and a dichotomous (binary) variable. It is an extension of the Pearson correlation coefficient, which measures the linear relationship between two continuous variables.

    The point-biserial correlation coefficient is specifically designed to assess the relationship between a continuous variable and a binary variable that represents two categories or states. It is often used when one variable is naturally dichotomous (e.g., pass/fail, yes/no) and the other variable is continuous (e.g., test scores, age).

    The coefficient ranges between -1 and +1, similar to the Pearson correlation coefficient. A value of +1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and 0 indicates no relationship.

    The calculation of the point-biserial correlation coefficient involves comparing the means of the continuous variable for each category of the binary variable and considering the variability within each category. The formula for calculating the point-biserial correlation coefficient is:

    r_{pb} = \frac{M_{1} - M_{0}}{s_{n}}\sqrt{pq}

    Here

    • M1 is the mean of the continuous variable for category 1 of the binary variable.
    • M0 is the mean of the continuous variable for category 0 of the binary variable.
    • s_{n} is the standard deviation of the entire population if available.
    • p = Proportion of cases in the “0” group.
    • q = Proportion of cases in the “1” group.

    You can also easily calculate this in Python using the scipy library.

    import scipy.stats as stats
    
    # Calculate the point-biserial correlation coefficient
    r_pb, p_value = stats.pointbiserialr(continuous_variable, binary_variable)
    

    Let me know in the comments in case you’ve any questions regarding the point-biserial correlation.

  • Cohen’s D – How to measure the difference in distributions

    While the t-test or Mann-Whitney U test can tell you whether two distributions are different from each other, it doesn’t tell you the degree to which they are different.

    For this purpose, you can calculate Cohen’s D.

    Cohens'D = \frac{(M1-M2)}{S_{pooled}}

    Where the pooled standard deviation can be defined as

    S_{pooled} = \sqrt{\frac{s_{1}^{2} + s_{1}^{2}}{2}}

    After calculating Cohen’s D you can gauge the difference via this thumb rule –

    • Small effect = 0.2
    • Medium Effect = 0.5
    • Large Effect = 0.8

    Below you can find the code to calculate Cohen’s D in python

    import numpy as np
    
    def cohens_d(x,y):
        var_x = np.var(x)
        var_y = np.var(y)
        mean_x = np.mean(x)
        mean_y = np.mean(y)
        pool_variance = np.sqrt((var_x**2 + var_y**2)/2)
        return (mean_x - mean_y)/pool_variance

    Write in the comments in case you’ve any questions regarding cohen’s D.