Tag: correlation

  • Correlation between numerical and categorical variable – Point Biserial Correlation

    We all know about Pearson correlation among numerical variables. But what if your target is binary and you want to calculate the correlation between numerical features and binary target. Well, you can do so using point-biserial correlation.

    The point-biserial correlation coefficient is a statistical measure that quantifies the relationship between a continuous variable and a dichotomous (binary) variable. It is an extension of the Pearson correlation coefficient, which measures the linear relationship between two continuous variables.

    The point-biserial correlation coefficient is specifically designed to assess the relationship between a continuous variable and a binary variable that represents two categories or states. It is often used when one variable is naturally dichotomous (e.g., pass/fail, yes/no) and the other variable is continuous (e.g., test scores, age).

    The coefficient ranges between -1 and +1, similar to the Pearson correlation coefficient. A value of +1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and 0 indicates no relationship.

    The calculation of the point-biserial correlation coefficient involves comparing the means of the continuous variable for each category of the binary variable and considering the variability within each category. The formula for calculating the point-biserial correlation coefficient is:

    r_{pb} = \frac{M_{1} - M_{0}}{s_{n}}\sqrt{pq}

    Here

    • M1 is the mean of the continuous variable for category 1 of the binary variable.
    • M0 is the mean of the continuous variable for category 0 of the binary variable.
    • s_{n} is the standard deviation of the entire population if available.
    • p = Proportion of cases in the “0” group.
    • q = Proportion of cases in the “1” group.

    You can also easily calculate this in Python using the scipy library.

    import scipy.stats as stats
    
    # Calculate the point-biserial correlation coefficient
    r_pb, p_value = stats.pointbiserialr(continuous_variable, binary_variable)
    

    Let me know in the comments in case you’ve any questions regarding the point-biserial correlation.

  • How To Calculate Correlation Among Categorical Variables?

    We know that calculating the correlation between numerical variables is very easy, all you have to do is call df.corr().

    But how do you calculate the correlation between categorical variables?

    If you have two categorical variables then the strength of the relationship can be found by using Chi-Squared Test for independence.

    The Chi-square test finds the probability of a Null hypothesis (H0).

    Assumption(H0): The two columns are not correlated. H1: The two columns are correlated. Result of Chi-Sq Test: The Probability of H0 being True

    We will be using the titanic dataset to calculate the chi-squared test for independence on a couple of categorical variables.

    import numpy as np
    import pandas as pd
    import seaborn as sns
    from matplotlib import pyplot as plt
    
    df = sns.load_dataset('titanic')
    corr = df[['age', 'fare', 'pclass']].corr()
    
    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))
    
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))
    
    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)
    
    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})
    

    Pretty easy to calculate the correlation among numerical variables.

    Lets first calculate first whether the class of the passenger and whether or not they survive have a correlation.

    # importing the required function
    from scipy.stats import chi2_contingency
    cross_tab=pd.crosstab(index=df['class'],columns=df['survived'])
    print(cross_tab)
    
    chi_sq_result = chi2_contingency(cross_tab,)
    p, x = chi_sq_result[1], "reject" if chi_sq_result[1] < 0.05 else "accept"
    
    print(f"The p-value is {chi_sq_result[1]} and hence we {x} the null Hpothesis with {chi_sq_result[2]} degrees of freedom")
    
    The p-value is 4.549251711298793e-23 and hence we reject the null Hpothesis with 2 degrees of freedom

    Similarly, we can calculate whether two categorical variables are correlated amongst other variables as well.

    Hopefully, this clears up how you can calculate whether two categorical variables are correlated or not in python. In case you have any questions please feel free to ask them in the comments.