We all know about Pearson correlation among numerical variables. But what if your target is binary and you want to calculate the correlation between numerical features and binary target. Well, you can do so using point-biserial correlation.
The point-biserial correlation coefficient is a statistical measure that quantifies the relationship between a continuous variable and a dichotomous (binary) variable. It is an extension of the Pearson correlation coefficient, which measures the linear relationship between two continuous variables.
The point-biserial correlation coefficient is specifically designed to assess the relationship between a continuous variable and a binary variable that represents two categories or states. It is often used when one variable is naturally dichotomous (e.g., pass/fail, yes/no) and the other variable is continuous (e.g., test scores, age).
The coefficient ranges between -1 and +1, similar to the Pearson correlation coefficient. A value of +1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and 0 indicates no relationship.
The calculation of the point-biserial correlation coefficient involves comparing the means of the continuous variable for each category of the binary variable and considering the variability within each category. The formula for calculating the point-biserial correlation coefficient is:
Here
- M1 is the mean of the continuous variable for category 1 of the binary variable.
- M0 is the mean of the continuous variable for category 0 of the binary variable.
is the standard deviation of the entire population if available.
- p = Proportion of cases in the “0” group.
- q = Proportion of cases in the “1” group.
You can also easily calculate this in Python using the scipy library.
import scipy.stats as stats
# Calculate the point-biserial correlation coefficient
r_pb, p_value = stats.pointbiserialr(continuous_variable, binary_variable)
Let me know in the comments in case you’ve any questions regarding the point-biserial correlation.