Correlation Analysis in Python

Correlation analyses measures the strength of the relationship between two variables. Correlation analyses can be used to test for associations in hypothesis testing. If the null hypothesis states there is correlation between the variables considered under the study. Then, the purpose is to investigate the possible association in the underlying variables.

The Scipy library in Python encompasees Stats module with pearsonr(), kendalltau(), spearsmanr() function to evaluate Pearson, Kendall and Spearsman Correlation co-efficient respectively.

Pearson’s Correlation

It is a parametric correlation test because it depends on the distribution of data. It measures the linear dependence between two variables x and y. It is the ratio between the covariance of two variables and the product of their standard deviation. The result always have a value between 1 and -1.

import pandas as pd
from scipy.stats import pearsonr

# Read the sample data
df = pd.read_csv("../data/NCCTG_Lung_Cancer_Data_535_29.csv")

# Convert dataframe into series
x = df['age']
y = df['meal.cal']

# Apply the pearsonr()
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: -0.240

Kendall’s Rank

A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. The Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.

import pandas as pd
from scipy.stats import kendalltau 

# Read the sample data
df = pd.read_csv("../data/NCCTG_Lung_Cancer_Data_535_29.csv")
  
# Convert dataframe into series
x=df['age']
y = df['meal.cal']


# Calculating Kendall Rank correlation 
corr, _ = kendalltau(x, y) 
print('Kendall Rank correlation: %.5f' % corr)

Kendall Rank correlation: -0.14581

Spearman’s Rank

Spearman’s Rank Correlation is a statistical measure of the strength and direction of the monotonic relationship between two continuous variables. Therefore, these attributes are ranked or put in the order of their preference. It is denoted by the symbol “rho” (ρ) and can take values between -1 to +1. A positive value of rho indicates that there exists a positive relationship between the two variables, while a negative value of rho indicates a negative relationship. A rho value of 0 indicates no association between the two variables.

import pandas as pd
from scipy.stats import spearmanr
 
# Read the sample data
df= pd.read_csv("../data/NCCTG_Lung_Cancer_Data_535_29.csv")

# Convert dataframe into series
x = df['age']
y = df['meal.cal']
 
# calculate Spearman's correlation coefficient and p-value
corr, pval = spearmanr(x, y)
 
# print the result
print("Spearman's correlation coefficient:", corr)
print("p-value:", pval)

Spearman's correlation coefficient: -0.21159256066613205
p-value: 0.006051240282927171