Correlation analyses measures the strength of the relationship between two variables. Correlation analyses can be used to test for associations in hypothesis testing. If the null hypothesis states there is correlation between the variables considered under the study. Then, the purpose is to investigate the possible association in the underlying variables.
The Scipy library in Python encompasees Stats module with pearsonr(), kendalltau(), spearsmanr() function to evaluate Pearson, Kendall and Spearsman Correlation co-efficient respectively.
Pearson’s Correlation
It is a parametric correlation test because it depends on the distribution of data. It measures the linear dependence between two variables x and y. It is the ratio between the covariance of two variables and the product of their standard deviation. The result always have a value between 1 and -1.
import pandas as pdfrom scipy.stats import pearsonr# Read the sample datadf = pd.read_csv("../data/NCCTG_Lung_Cancer_Data_535_29.csv")# Convert dataframe into seriesx = df['age']y = df['meal.cal']# Apply the pearsonr()corr, _ = pearsonr(x, y)print('Pearsons correlation: %.3f'% corr)
Pearsons correlation: -0.240
Kendall’s Rank
A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. The Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.
Spearman’s Rank Correlation is a statistical measure of the strength and direction of the monotonic relationship between two continuous variables. Therefore, these attributes are ranked or put in the order of their preference. It is denoted by the symbol “rho” (ρ) and can take values between -1 to +1. A positive value of rho indicates that there exists a positive relationship between the two variables, while a negative value of rho indicates a negative relationship. A rho value of 0 indicates no association between the two variables.
import pandas as pdfrom scipy.stats import spearmanr# Read the sample datadf= pd.read_csv("../data/NCCTG_Lung_Cancer_Data_535_29.csv")# Convert dataframe into seriesx = df['age']y = df['meal.cal']# calculate Spearman's correlation coefficient and p-valuecorr, pval = spearmanr(x, y)# print the resultprint("Spearman's correlation coefficient:", corr)print("p-value:", pval)