Tag: Chi-Squared

  • How To Calculate Correlation Among Categorical Variables?

    We know that calculating the correlation between numerical variables is very easy, all you have to do is call df.corr().

    But how do you calculate the correlation between categorical variables?

    If you have two categorical variables then the strength of the relationship can be found by using Chi-Squared Test for independence.

    The Chi-square test finds the probability of a Null hypothesis (H0).

    Assumption(H0): The two columns are not correlated. H1: The two columns are correlated. Result of Chi-Sq Test: The Probability of H0 being True

    We will be using the titanic dataset to calculate the chi-squared test for independence on a couple of categorical variables.

    import numpy as np
    import pandas as pd
    import seaborn as sns
    from matplotlib import pyplot as plt
    
    df = sns.load_dataset('titanic')
    corr = df[['age', 'fare', 'pclass']].corr()
    
    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))
    
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))
    
    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)
    
    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})
    

    Pretty easy to calculate the correlation among numerical variables.

    Lets first calculate first whether the class of the passenger and whether or not they survive have a correlation.

    # importing the required function
    from scipy.stats import chi2_contingency
    cross_tab=pd.crosstab(index=df['class'],columns=df['survived'])
    print(cross_tab)
    
    chi_sq_result = chi2_contingency(cross_tab,)
    p, x = chi_sq_result[1], "reject" if chi_sq_result[1] < 0.05 else "accept"
    
    print(f"The p-value is {chi_sq_result[1]} and hence we {x} the null Hpothesis with {chi_sq_result[2]} degrees of freedom")
    
    The p-value is 4.549251711298793e-23 and hence we reject the null Hpothesis with 2 degrees of freedom

    Similarly, we can calculate whether two categorical variables are correlated amongst other variables as well.

    Hopefully, this clears up how you can calculate whether two categorical variables are correlated or not in python. In case you have any questions please feel free to ask them in the comments.